Objective: build a classification model to predict clients who are likely to default on their loan and give recommendations to the bank on the important features to consider while approving a loan.
Data Description The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.
BAD: 1 = Client defaulted on loan, 0 = loan repaid
LOAN: Amount of loan approved.
MORTDUE: Amount due on the existing mortgage.
VALUE: Current value of the property.
REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB: The type of job that loan applicant has such as manager, self, etc.
YOJ: Years at present job.
DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE: Age of the oldest credit line in months.
NINQ: Number of recent credit inquiries.
CLNO: Number of existing credit lines.
DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.
(Milestone submission and full code follows -- Additional code added at the end under "Annex")
Loan default represents a critical financial risk that requires to be thoroughly understood to minimise its impact. On this dataset only, default occurs to approx. 20% of cases (19.95%), and amounts to 20,120,400 USD/EUR (currency not given), which represents 18.14% of the total loaned.
It is essential for the bank to know which key variables, patterns and behaviours are correlated with higher default rates to take these elements into account during the loan application evaluation. Drawing a detailed picture of the various risk profiles associate with loan defaults, through data analyses and model building, will allow the bank to effectively reduce its risk exposure, by denying loans to future defaulters, while securing revenues, by granting loans to good payers. The proposed solution should support the bank in navigating this unstable environment, and find the fine line allowing for maximised profits and minimised risk exposure.
Before moving forward, it must be stated that the solution proposed is based on the analysis of a limited dataset. Therefore, the recommendations ought to be progressively rolled out, and regularly assessed to implement the needed corrections.
Our analysis identified two critical risk factors: debt-to-income ratio (DEBTINC) and the age of the oldest credit line (CLAGE). For DEBTINC, the highest absolute risk of default is found in clients with a DEBTINC of 34-35 and 40+, with targeted interventions potentially reducing defaults by over 82% or by 16,538,100 EUR/USD. However, looking at the relative risk allow us to identify small sub-categories that count a higher proportion of defaulting clients, which could represent quick wins for the bank. This is notably the case for clients with a DEBTINC of 0-6 who present the highest proportion of defaults. Suggesting a focus on these small groups could yield quick financial benefits, and represent a saving of a little over 3.1 million EUR/USD.
Similarly, for CLAGE, clients with a CLAGE of 100-150 months pose the highest absolute risk, while those in the 0-50 and 0-70 month ranges show elevated relative risk, presenting opportunities for targeted risk reduction.
See Annex for figures
The Tuned Gradient Booster model was selected for its strong recall performance (0.85), which is crucial for identifying potential loan defaulters. However, with a recall of 0.72 for class 1, the model is recommended as a support tool for human loan application reviews rather than a standalone decision-maker. Nonetheless, the model is efficient enough to identify the key characteristics of client accounts correlated to loan default, and provide essential insights to support and guide the human review.
The low computational power requested to run model also make it suitable for scalable implementation at the branch level. Allowing each branch to run the model with its own dataset (without tuning) will provide a tailored solution to address specific risk factors, and reduce the bank's exposure to the financial risk induced by loan default.
The solution is accompanied by a set of concrete recommendations, explained in more details in the Recommendation for Implementation Section below. These recommendations notably include strengthening the bank's data infrastructure, ensuring consistent loan application procedures across branches, implementing tailored monitoring mechanisms for high-risk sub-groups (alongside human oversight), as well as financial education programs. All these actions will be supported by cycles of staff training, and continuous model tuning and improvement. These actions aim to reduce default rates, optimise risk management, and maintain profitability (please refer to the Recommendation for Implementation Section for more details).
Several key next steps are identified (for more details please refer to the Recommendations for Implementation section). Action points 2, 3, 4 will be roll-out simultaneously, and in a cyclic manner:
Further efforts to deploy, once the new system is in place:
Loan default represents a critical financial risk that requires to be thoroughly understood to minimise its impact. On this dataset only, default occurs to approx. 20% of cases (19.95%), and amounts to 20,120,400 USD/EUR (currency not given), which represents 18.14% of the total loaned.
It is essential for the bank to know which key variables, patterns and behaviours are correlated with higher default rates to take these elements into account during the loan application evaluation. Drawing a detailed picture of the various risk profiles associate with loan defaults, through data analyses and model building, will allow the bank to effectively reduce its risk exposure, by denying loans to future defaulters, while securing revenues, by granting loans to good payers. The proposed solution should support the bank in navigating this unstable environment, and find the fine line allowing for maximised profits and minimised risk exposure.
Before moving forward, it must be stated that the solution proposed is based on the analysis of a limited dataset. Therefore, the recommendations ought to be progressively rolled out, and regularly assessed to implement the needed corrections.
What are the key points that describe the final proposed solution design? - The reason for the proposed solution design. How it would affect the problem/business?
Chosen Model: the Tuned Gradient Booster offered the best performance on the Recall Metrics, which was identified as the most important performance metrics, reaching 0.85. Still, Precision and Accuracy remained very high (over 0.9).
However, the Recall on class 1 is 0.72, which higlights that at this stage the model must be implemented as a support to human reviews of loan applications, rathen than as its replacement.
Nonetheless, the model allows to identify key features that are conducive of default to loans, and these findings should orient the bank employees reviewing loan applications, and help them shed light on higher risk profiles.
Additionally, the chosen model requires a limited computational power. This is essential when taking into account scalability concerns, as this would allow the model to be run directly at branch level, with local extracts of the dataset.
Through this analysis, we will highlight the absolute risk factors (highest default scores per feature), but also mention the relative risk factors (highest default scores in sub-groups of a said feature). The focus remains on the higher risks in absolute numbers, as such will allow the bank to maximise its impact, and strongly reduce its risk exposure. Nonetheless, we recommend that the bank considers the relative risks as quick wins measures for the company to immediately reduce exposure, as such concerns groups that are smaller in size, but which count higher concentration of defaults. For each risk factor, associated levers will be recommended for the bank to activate for immediate risk mitigation.
ABSOLUTE RISK : The highest risk sub-group (highest proportion of defaulters) is present for the clients whose DEBTINC = 34-35, as such represent 64,204 % of defaults. Addressing these would allow the bank to save 12,918,000 USD/EUR. Keeping a keen eye on the second risk group, with a 40+ DEBTINC ratios would also allow to save 17.992% of defaults, amounting to 3,620,100 USD/EUR.
All in all, closely monitoring, and reducing exposure to risk in these two ratio groups could allow to reduce defaults by 82.196%.
RELATIVE RISK : relative and absolute risks identify different categories that should be addressed by the bank. The highest relative risk (highest proportion of defaulters within sub-group) appears for the clients with a DEBTINC between 0-6, as this category present an overall default risk of 40.74%. Representing only 1.54% of all loans, activating the levers to decrease risk exposure on this sub-group would still allow the bank to save 3.1 million USD/EUR, thereby representing an interesting quick win.
As a comparison, the highest risk category (in absolute terms), displays a much lower risk proportion in relative terms:
- **DEBTINC 34-35: risk ratio only at 5.66%**
The 2nd risk sub-group should be analysed further:
- **DEBTINC 40+: risk ratio at 19.02%** -- the risk proportion is quite high, however we must keep in mind that this category includes all the **upper outliers (from Q3 until the max value (203,31)**. It is expected that these **outliers are not a homogenous, high risk cohort**, but are rather composed of various sub-groups, each with varying risk levels. **Further analysis would be beneficial for the bank to best approach and monitor this risk factor.**
ABSOLUTE RISK : the highest group sub-group can be found in the clients with CLAGE=100-150, as such group includes 31.127% of defaults, totaling a 5,909,100 USD/EUR for the bank. The second highest risk sub-groups are the CLAGE=50-100 with 18.468% defaults for 3,505,900 USD/EUR, and CLAGE=150-200 with 19.695% defaults, for 3,738,900 USD/EUR.
However, when trying to break down the size of the sub-groups to draw a granular, more detailed picture of the situation, interesting patters appear as the highest risk sub-groups change:
In absolute terms, we still recommend the bank to keep a close eye on the general high-risk sub-groups identified (100-150, followed by 150-200 and 50-100), but keep a close eye on the 170-180, and 250+. Further analyses are also recommended on the later sub-group, in order to better understand the risk-structure of the outliers.
RELATIVE RISK : we identify a higher (relative) risk in the lower ranges:
Addressing the relative risk could represent a quick-win for the bank.
Key recommendations are detailed below in key and sub-recommendations ("actionables"), and include the specific stakeholders in charge for bringing these recommendations to life (Direct Responsibility) and on-the-ground implementation (Implementation).
The missing data highlighted in the EDA reflects that the data collection is not automatised across the bank branches when applying for a loan. To ensure coherent and cohesive analysis, and improve model performance, it would be most urgent to:
--> once designed, the employees should be trained on the new procedures and tools
Direct Responsibility: Global CTO
Implementation: IT Leads at Regional and Branch-level, Client Relations Representatives at Branch-Level (client relations rep = client-facing employees on all loan-related matters)
Mitigate Risks for High-Risk Sub-Groups: Set up specific monitoring system focused on the key features identified by the model (DEBTINC and CLAGE), and monitor closely clients displaying the highest risk characteristics : DEBTINC=34-45 and 40+, and CLAGE=100-150, 170-180 and 250+** months.
Human Review for Risky Loans Application: Entirely denying loans for clients falling in the high-risk sub-groups mentioned above is not an option as, as a reminder, even within the DEBTINC=34-35 (our highest risk sub-group), defaults still represent less that 6% of all loans in this sub-group. The bank cannot afford to turn down all the clients that represent this DEBTINC ratio, and that would, in fact, reimburse their loans. However, it is recommended to integrate additional verifications (human-led) during the loan application preparation, and after the loan is granted (see previous point)
Financial Education and Tailored Client Service: Accompanying risky client accounts is key to decrease risk exposure while securing revenues. This should be led through 2 parallel streams:
--> the Financial Education might request external consulting services to develop the material required to fulfil this objective.
Direct Responsibility:IT Leads (for model outputs understanding), branch directors, optional: external consulting services
Implementation:Client Relations Representatives
Human Oversight: The Gradient Booster Model presents a better prediction for class 0 than class 1 (default), which reaches 0.72. Thus, if the model can be of help to the bank employees, human review is essential to validate and interpret the model's predictions, ensuring that loan decisions are robust and consider nuances the model may miss. The model should be used to shed light on risk factors, thereby facilitating the loan applications' reviews. The model should be used both before the loan approval, and during the loan life cycle to serve as a continuous monitoring system alerting humans when risk-prove patterns are identified (see point 1).
Model Tuning: As the data quality is expected to increase with the implementation of the data infrastructure recommendations (point 1.), it is essential to regularly tune the Gradient Booster Model at branch level. Keeping the models at local levels will ensure that the risk analysis is tailored to the specific client base of the branch, thereby allowing for adequate monitoring and mitigating measures. The key features are expected to evolve, and could even change.
Alternative Models Maintenance: It is also important to maintain, at global level, 2-3 alternative models. As the bank gathers more data and the external environment evolves, this approach will help ensure that the model continues to perform optimally and adapt to new patterns of risk. It is however recommended that the same model, with the same hyperparameter tuning, is rolled out across the company to ensure that the performance of each branch can be compared and analysed on the same basis. Only the dataset ran through the model should change (to only include the client base of the branch).
Direct Responsibility: CTO, IT Global team
Implementation: IT regional Leads (in this model the IT branch Leads only run the local datasets, without touching the model tuning itself)
By implementing these recommendations, the bank will be better equipped to manage its loan portfolio, reduce default rates, and optimise its risk exposure, all while maintaining profitability and supporting sustainable growth.
The section below presents the risks/challenges (in bold) as well as the potential mitigating mechanisms to implement (signaled by the arrow "-->" sign).
--> Mitigation: Human Oversight (explained previously)
--> Mitigation: Conduct a thorough market analysis analysing competitors' procedures (which documents required, delays, number of steps etc.) to ensure that the recommended changes (Coherent Loan Access Procedures)
--> Mitigation: Continuously tune and maintain alternative models to cross analyse performance and ensure optimal results
--> Mitigation: Regular exploration integrate other potential features into the model and regularly reassess the importance of existing ones.
--> Mitigation: Maintaining employee engagement and regularly controlling client status and experience
In-Depth Analysis of Outliers: Further analyze outlier groups within DEBTINC (e.g., 40+ range) and CLAGE (e.g., 170-180 and 250+ ranges) to better understand the risk structures and sub-groups within these brackets. This will enable the bank to refine its risk management strategies and tailor its loan approval process more effectively.
Absolute vs Relative Risks: Prioritize absolute risk reduction as it offers the greatest potential to minimize financial losses. However, relative risks should not be overlooked as they provide opportunities for quick wins that can improve the overall health of the loan portfolio.
Direct Responsibility: CTO, Client Relations Representative Global level
Implementation: IT Global team (outliers analysis), Branch directors (absolute vs relative risks mitigation, as such will depend on the branch's client base).
Loan default represents one of the key financial risk for a bank, and therefore has to be well understood in order to be minised. The understanding of this risk goes through the definition of the key variable values that are correlated with higher default rates, and therefore higher risks. To do so, we will conduct the EDA to build some intuition / understanding of the context, before moving to the model building.
The objective is to develop a model that will help guide the bank's risk monitoring, assessment, and mitigation system by developping a better understanding of the risk profile of defaulting clients, but also, and crucially so, the key features that characterise them. In order to avoid any discrimination imputed onto clients, we will develop decision tree, random forest and a number of other models aimed at identifying the key features, that, away from a demographics-based profile, characterise risk-prone behaviours, and should therefore be monitored by the bank to mitigate and limit risk exposure.
The key questions to address through this analysis are the following:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn import metrics
# Linear Model
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error
#Tree and Random Forest
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, recall_score, precision_score, accuracy_score
#Tuning and Boosting
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
!pip install xgboost
from xgboost import XGBClassifier
import catboost as cb
!pip install lightgbm
import lightgbm as lgb
import warnings
warnings.filterwarnings("ignore")
#Column display and numericals simplication for readability purposes
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format",lambda x:"%.3f" % x)#
Requirement already satisfied: xgboost in ./lib/python3.11/site-packages (2.1.0) Requirement already satisfied: numpy in ./lib/python3.11/site-packages (from xgboost) (1.24.3) Requirement already satisfied: scipy in ./lib/python3.11/site-packages (from xgboost) (1.11.1) Requirement already satisfied: lightgbm in ./lib/python3.11/site-packages (4.5.0) Requirement already satisfied: numpy>=1.17.0 in ./lib/python3.11/site-packages (from lightgbm) (1.24.3) Requirement already satisfied: scipy in ./lib/python3.11/site-packages (from lightgbm) (1.11.1)
data=pd.read_csv("/Users/elsabendine/Desktop/Capstone Project/hmeq.csv")
dt=data.copy()
print("Data Shape :", dt.shape)
Data Shape : (5960, 13)
dt.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| BAD | 1 | 1 | 1 | 1 | 0 |
| LOAN | 1100 | 1300 | 1500 | 1500 | 1700 |
| MORTDUE | 25860.000 | 70053.000 | 13500.000 | NaN | 97800.000 |
| VALUE | 39025.000 | 68400.000 | 16700.000 | NaN | 112000.000 |
| REASON | HomeImp | HomeImp | HomeImp | NaN | HomeImp |
| JOB | Other | Other | Other | NaN | Office |
| YOJ | 10.500 | 7.000 | 4.000 | NaN | 3.000 |
| DEROG | 0.000 | 0.000 | 0.000 | NaN | 0.000 |
| DELINQ | 0.000 | 2.000 | 0.000 | NaN | 0.000 |
| CLAGE | 94.367 | 121.833 | 149.467 | NaN | 93.333 |
| NINQ | 1.000 | 0.000 | 1.000 | NaN | 0.000 |
| CLNO | 9.000 | 14.000 | 10.000 | NaN | 14.000 |
| DEBTINC | NaN | NaN | NaN | NaN | NaN |
Data Overview Reading the dataset Understanding the shape of the dataset Checking the data types Checking for missing values Checking for duplicated values
dt.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
Notes: all values but REASON and JOB are numerical, split between int and float types. Let's look at unique values in each variable.
dt.nunique()
BAD 2 LOAN 540 MORTDUE 5053 VALUE 5381 REASON 2 JOB 6 YOJ 99 DEROG 11 DELINQ 14 CLAGE 5314 NINQ 16 CLNO 62 DEBTINC 4693 dtype: int64
dt.duplicated().sum()
0
dt.isna().sum()
BAD 0 LOAN 0 MORTDUE 518 VALUE 112 REASON 252 JOB 279 YOJ 515 DEROG 708 DELINQ 580 CLAGE 308 NINQ 510 CLNO 222 DEBTINC 1267 dtype: int64
missing=dt.isna().sum()
print("Percentage of missing values per feature/variable")
print(missing/len(dt)*100)
Percentage of missing values per feature/variable BAD 0.000 LOAN 0.000 MORTDUE 8.691 VALUE 1.879 REASON 4.228 JOB 4.681 YOJ 8.641 DEROG 11.879 DELINQ 9.732 CLAGE 5.168 NINQ 8.557 CLNO 3.725 DEBTINC 21.258 dtype: float64
Observations :
We will proceed with the analysis before implementing the missing values treatment, in order to ensure that we do not to influence the EDA.
Note for the Evaluator: the guiding questions will be addressed in two steps:
num_col=["BAD", "LOAN", "MORTDUE", "VALUE", "YOJ", "DEROG", "DELINQ", "CLAGE", "NINQ", "CLNO", "DEBTINC"]
cat_col=["REASON", "JOB"]
Summary Statistics
dt[num_col].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| BAD | 5960.000 | 0.199 | 0.400 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| LOAN | 5960.000 | 18607.970 | 11207.480 | 1100.000 | 11100.000 | 16300.000 | 23300.000 | 89900.000 |
| MORTDUE | 5442.000 | 73760.817 | 44457.609 | 2063.000 | 46276.000 | 65019.000 | 91488.000 | 399550.000 |
| VALUE | 5848.000 | 101776.049 | 57385.775 | 8000.000 | 66075.500 | 89235.500 | 119824.250 | 855909.000 |
| YOJ | 5445.000 | 8.922 | 7.574 | 0.000 | 3.000 | 7.000 | 13.000 | 41.000 |
| DEROG | 5252.000 | 0.255 | 0.846 | 0.000 | 0.000 | 0.000 | 0.000 | 10.000 |
| DELINQ | 5380.000 | 0.449 | 1.127 | 0.000 | 0.000 | 0.000 | 0.000 | 15.000 |
| CLAGE | 5652.000 | 179.766 | 85.810 | 0.000 | 115.117 | 173.467 | 231.562 | 1168.234 |
| NINQ | 5450.000 | 1.186 | 1.729 | 0.000 | 0.000 | 1.000 | 2.000 | 17.000 |
| CLNO | 5738.000 | 21.296 | 10.139 | 0.000 | 15.000 | 20.000 | 26.000 | 71.000 |
| DEBTINC | 4693.000 | 33.780 | 8.602 | 0.524 | 29.140 | 34.818 | 39.003 | 203.312 |
Visualisation
#Setting the titles
feature_descriptions={
"BAD":"Defaulting Clients (BAD)",
"LOAN": "Approved Loan Value (LOAN)",
"MORTDUE": "Existing Mortgage Value (MORTDUE)" ,
"VALUE":"Property Value (VALUE)",
"REASON":"Loan Request Reason (REASON)",
"JOB":"Type of Job (JOB)",
"YOJ":"Years at Job (YOJ)",
"DEROG": "Derogatory Reports (DEROG)",
"DELINQ":"Delinquent Credit Lines (DELINQ)",
"CLAGE":"Oldest Credit Line Age (CLAGE)",
"NINQ":"Recent Credit Inquiries (NINQ)",
"CLNO":"Existing Credit Lines (CLNO)",
"DEBTINC":"Debt-to-Income Ratio (DEBTINC)",
}
def title_graph(col):
description=feature_descriptions.get(col)
return f'{description} Distribution'
#Visualisation of Numerical Variables
def hist_box(col):
f, (ax_box, ax_hist)=plt.subplots(2, sharex=True, gridspec_kw={"height_ratios":(0.15,0.85)}, figsize=(7,5))
sns.set(style="darkgrid")
title=title_graph(col)
sns.boxplot(x=dt[col], ax=ax_box, showmeans=True, color="darkseagreen", meanprops={"marker":"s", "markerfacecolor":"black"}).set_title(title)
sns.distplot(dt[col],ax=ax_hist, color="darkblue")
ax_hist.axvline(dt[col].mean(), color="black", linestyle="-.")
ax_hist.axvline(dt[col].median(), color="orange", linestyle="-")
plt.show()
fix, ax=plt.subplots()
sns.histplot(x=dt["BAD"], bins=2, color="darkslateblue").set_title("Defaulting Clients (BAD) distribution")
ax.set_xlim(0, 1)
ax.set_xticks(range(0, 2))
plt.show()
hist_box("LOAN")
hist_box("MORTDUE")
hist_box("VALUE")
hist_box("YOJ")
hist_box("DEROG")
hist_box("DELINQ")
hist_box("CLAGE")
hist_box("NINQ")
hist_box("CLNO")
hist_box("DEBTINC")
Initial Observations
BAD : the mean value is at 0.199, which seems to demonstrate that 80% of the clients in the dataset have not defaulted to their loan (BAD=0=. As the std reached twice the value of the mean (0.4) which could reflect skewness in the data;
LOAN : the Qartiles are quite uniformly distributed Q1 11.1k, Q2 at 16.3k (quite close), Q3 23.3k, however we note important outliers in both directions: lower outliers (min value 1.1k, meaning 1/10th of the Q1), but especially higher outlier (max value 89.9k, meaning almost 4 times the Q3 value). The mean value is quite close to the Q2, hence a close-to-normal distribution is expected, however the std is quite high (at 11.2k)
MORTDUE : large distribution of the data. The difference between the quartiles is reasonably close (20-25k) and seems to indicate a close-to-normal distribution Q1: 46.3k, Q2: 65k, Q3: 91.5k, however we take note of strong outliers both in the min and max values. The min value: 2.1k, which is about 1/20th of the Q1 value, while the max value: 399.55k, which is about 1/4th of the Q3 value. We also note that the mean value is superior to the Q2 (73.76k), but still closer to Q2 (median) than to the Q3 value. We also note a large std (44.5k) which seems to reflect wide disparities in the dataset, and is impacted by the higher outliers which strongly impact the structure of the dataset. Comment: we take note that min MORTDUE (2.1k)> min LOAN value (1.1k). This could be because LOAN only reflects the amount of money borrowed, without taking into account the interests, while MORTDUE = (LOAN * interest rate) - LOAN payments + penalties (if any). This structure could explain a min MORTDUE > min LOAN value
VALUE : large distribution of the data. The distance between the quartiles is comparable (close to 30k), and we expect a rather normal distribution between Q1: 66k, Q2: 89.2k, Q3: 119.8k, however, we note strong outliers both in the min and max values. Min value: 8k, max: 855.9k. The very strong upper outlier(s) strongly impact the structure of the data, as reflected in the std: 57.4k, which is very close to the Q1 value. This reflects a highly imbalanced data. The mean is also impacted, and surpasses the median, as it reached 101,8k.
YOJ : left-skewed data with Q1: 3, Q2: 7, Q3:13 (almost twice as much as Q2). We also note outliers: min:0, and max:41. These outliers expectedly impact the mean value which surpasses the median (8.9 years) however the impact is limited, which seems to reflect a limited number of upper-outliers, or the presence of even more lower outliers which would negate the effect of the former group. The high std (7.5y, above the Q2 value), reflects a certain imbalance in the data.
DEROG : the vast majority of clients (and rows) are not concerned by the feature as Q1=Q2=Q3=0, however we note the presence of upper outliers with max=10. This seems to impact the structure of the data set as mean =0.255, and std = 0.846. Nonetheless these values are low which seems to confirm that the presence of DEROG!=0 is quite marginal. We will evaluate this further in the EDA.
DELINQ : Similar trend to the DEROG with Q1=Q2=Q3=0, however we note upper outlier with max value = 15. The latter impacts the mean value (0.45) and std (1.13), although reasonably so. This could reflect the limited presence of DELINQ !=0
CLAGE : left-skewed variable with Q1:115.12, Q2: 173.47, Q3: 231.56 with a difference between Q2 and Q3 being the double of the one between Q1 and Q2. In addition, we note upper outlier(s) with max value = 1168.23. We note that the mean value (179.77) is close to the median (Q2). The std is important (85.8 where Q1: 115.12), however remains reasonable, which could indicate a limited number of upper outliers, or the presence of even more lower outliers (between Q1 and min value (0)). This variable being calculated in month, we note that the max value = 97 years!
NINQ : normally distributed data between Q1 and Q3 (Q1: 0, Q2: 1, Q3: 2), but we take note of important upper outlier(s) with max value = 17. The latter impacts the data structure with mean = 1.18 (slightly above median), and std at 1.7 (closer to Q3 than Q2). Although notable, the limited impact on the data could reflect a limited number of such upper outliers. We will evaluate this further during the EDA:
CLNO : normally distributed data from Q1 to Q3 (Q1: 15, Q2: 20, Q3: 26) but we take note of very important upper outliers with max value: 71 (open credit lines!!). We note that the mean value (21.3) is very close to the median value, while the std reaches 10.14, reflecting a quite balanced dataset, other than the upper outliers.
DEBTINC : before drawing any observations, we must remind ourselves that close to 1/4th of the data is missing (NaN). Amongst the existing data: we note a very dense data distribution (Q1: 29.14, Q2: 34:82, Q3: 39), but we note very strong outliers with max value: 203.31, and min value: 0.524. The mean value (33.78) is very close to the median, and the low std 8.6 seem to reflect a rather well distributed dataset, despite high outliers. The latter, could be in low numbers, however such will be further analysed in the EDA.
Summary Statistics
#Summary Stats of Categorical Data
for i in cat_col:
print((dt[i].value_counts(normalize=True)*100).to_string(name=False))
print("-."*20)
REASON DebtCon 68.816 HomeImp 31.184 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. JOB Other 42.035 ProfExe 22.461 Office 16.687 Mgr 13.501 Self 3.397 Sales 1.919 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
Visualisation
def cat_graph(col, color):
title=title_graph(col)
labels = dt[col].value_counts().reset_index()
labels.columns = [col, "Count"]
fig=px.bar(labels,x=col,y="Count", template="plotly_white", title=title, text="Count")
fig.update_traces(marker_color=color, opacity=1, textposition="outside")
fig.update_layout(autosize=False,width=400, height=500)
fig.show()
cat_graph("REASON", "darkslateblue")
cat_graph("JOB", "darkslateblue")
Initial Observation:
REASON: We note 68.82% of debt consolidation against 31.18% of Home Improvement loans in the dataset (for the available data). This could be a source of concern if Debt Consolidation is used by clients facing difficulties with their loan repayments, and not also by client "bying over" their loans from their banks when directive interest rates are lowered by the national bank (as such is possible in my country for instance). In order to clarify wether this variable reflects a degree of risk, we will evaluate this in the EDA.
JOB: The client dataset is quite diverse in its professional background. The largest group identified is represented by "ProfExe", followed by "Office", "Manager" (from 22.46% to 13.5%), and a minority of "Self" employed and "Sales" workers(3.3 and 1.9%). However, we note that 42% of clients are notes as "Others", which could represent a large diversity in the client base, or a lack of representations in the categories available in the survey/client data collection tool.
Correlation Analysis
plt.figure(figsize=(7,4))
corr=data[num_col].corr()
map=np.triu(corr)
sns.heatmap(data[num_col].corr(), annot=True, fmt=".1f", cmap="crest", mask=map)
plt.show()
Observation: Correlations are not very strong (neither in a positive or negative way (range from -0.2 to 0.4) with one exception (below).
Target variable: BAD:
Second variable of interest: DEBTINC: keep in mind that we have a lot of missing data. Nonetheless, on the accessible data we note positive correlations (0.2) with:
Stronger positive correlations (0.3 and above)
Weaker positive correlations(0.2 and lower)
Numerical Variables - Correlation visualisation To explore these correlations further, we will: bin the data and visualise the risk level (BAD proportion in each variable sub-group).
#Copy the dataset to create bins and facilitate the visualisation
dt_bins=dt.copy()
#Bining LOAN
bins = [0, 5000, 10000, 15000, 20000, 25000, np.inf]
names = ['<5000', '5000-1000', '10000-15000', '15000-20000','20000-25000', '25000+']
dt_bins['LOAN'] = pd.cut(dt_bins['LOAN'], bins, labels=names, right=False)
dt_bins['LOAN'] = dt_bins['LOAN'].cat.reorder_categories(names, ordered=True)
#Bining MORTDUE
bins = [0, 20000, 40000, 60000, 80000, 100000, np.inf]
names = ['<20000', '20000-40000', '40000-60000', '60000-80000','80000-100000', '100000+']
dt_bins['MORTDUE'] = pd.cut(dt_bins['MORTDUE'], bins, labels=names, right=False)
dt_bins['MORTDUE'] = dt_bins['MORTDUE'].cat.reorder_categories(names, ordered=True)
#Bining VALUE
bins = [0, 50000, 100000, 150000, 200000, 250000, np.inf]
names = ['<50000', '50000-100000', '100000-150000', '150000-200000','200000-250000', '250000+']
dt_bins['VALUE'] = pd.cut(dt_bins['VALUE'], bins, labels=names, right=False)
dt_bins['VALUE'] = dt_bins['VALUE'].cat.reorder_categories(names, ordered=True)
#Bining YOJ
bins = [0, 3, 6, 9, 12, 15, np.inf]
names = ['<3', '3-6', '6-9', '9-12', '12,15', '15+']
dt_bins['YOJ'] = pd.cut(dt_bins['YOJ'], bins, labels=names, right=False)
dt_bins['YOJ'] = dt_bins['YOJ'].cat.reorder_categories(names, ordered=True)
#Bining DEROG
bins = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, np.inf]
names = ['<1', '1-2', '2-3', '3-4','4-5','5-6', '6-7', '7-8', '8-9', '9-10', '10+']
dt_bins['DEROG'] = pd.cut(dt_bins['DEROG'], bins, labels=names, right=False)
dt_bins['DEROG'] = dt_bins['DEROG'].cat.reorder_categories(names, ordered=True)
#Bining DELINQ
bins = [0, 2, 4, 6, 8, 10, 12, 14, np.inf]
names = ['<2', '2-4', '4-6', '6-8', '8-10', '10-12', '12-14', '14+']
dt_bins['DELINQ'] = pd.cut(dt_bins['DELINQ'], bins, labels=names, right=False)
dt_bins['DELINQ'] = dt_bins['DELINQ'].cat.reorder_categories(names, ordered=True)
#Bining CLAGE
bins = [0, 50, 100, 150, 200, 250, np.inf]
names = ['<50', '50-10', '100-150', '150-200','200-250', '250+']
dt_bins['CLAGE'] = pd.cut(dt_bins['CLAGE'], bins, labels=names, right=False)
dt_bins['CLAGE'] = dt_bins['CLAGE'].cat.reorder_categories(names, ordered=True)
#Bining NINQ
bins = [0, 0.5, 1, 1.5, 2, 2.5,3, np.inf]
names = ['<0.5', '0.5-1', '1-1.5', '1.5-2','2-2.5', '2.5-3','3+']
dt_bins['NINQ'] = pd.cut(dt_bins['NINQ'], bins, labels=names, right=False)
dt_bins['NINQ'] = dt_bins['NINQ'].cat.reorder_categories(names, ordered=True)
#Bining CLNO
bins = [0, 5, 10, 15, 20, 25,30, np.inf]
names = ['<5', '5-10', '10-15', '15-20','20-25', '25-30','30+']
dt_bins['CLNO'] = pd.cut(dt_bins['CLNO'], bins, labels=names, right=False)
dt_bins['CLNO'] = dt_bins['CLNO'].cat.reorder_categories(names, ordered=True)
#Bining DEBTINC
bins = [0, 10, 20, 30, 40, np.inf]
names = ['<10', '10-20', '20-30', '30-40', '40+']
dt_bins['DEBTINC'] = pd.cut(dt_bins['DEBTINC'], bins, labels=names, right=False)
dt_bins['DEBTINC'] = dt_bins['DEBTINC'].cat.reorder_categories(names, ordered=True)
#Graph function
def biv_hist(col):
title=title_graph(col)
color_map={0:"darkblue", 1:"darkseagreen"}
fig=px.histogram(dt_bins,x=col, color="BAD", barmode="stack", nbins=10, opacity=0.6, color_discrete_map=color_map, text_auto=True, title=title)
fig.update_traces(textposition='outside')
fig.update_layout(height=600, width=600, bargap=0.05, title=dict(text=title, x=0.5, xanchor="center", font=dict(size=24)))
fig.show()
#Table function
from tabulate import tabulate
def key_feat(col,head):
breakdown=(dt_bins.groupby(col)["BAD"].value_counts(normalize=True)*100).unstack().fillna(0)
breakdown=breakdown.sort_index()
key_feat_subs= breakdown[1].sort_values(ascending=False).head(head)
key_feat_df = key_feat_subs.reset_index()
key_feat_df.columns = [col, '% of defaulters (BAD=1)']
table = tabulate(key_feat_df, headers='keys', tablefmt='fancy_grid', showindex=False)
print(table)
biv_hist("LOAN")
print("Percentage of Default per category")
key_feat("LOAN", 10)
Percentage of Default per category ╒════════╤═══════════════════════════╕ │ LOAN │ % of defaulters (BAD=1) │ ╞════════╪═══════════════════════════╡ │ 1100 │ 100 │ ├────────┼───────────────────────────┤ │ 47800 │ 100 │ ├────────┼───────────────────────────┤ │ 77200 │ 100 │ ├────────┼───────────────────────────┤ │ 45900 │ 100 │ ├────────┼───────────────────────────┤ │ 53400 │ 100 │ ├────────┼───────────────────────────┤ │ 1300 │ 100 │ ├────────┼───────────────────────────┤ │ 36900 │ 100 │ ├────────┼───────────────────────────┤ │ 37300 │ 100 │ ├────────┼───────────────────────────┤ │ 38800 │ 100 │ ├────────┼───────────────────────────┤ │ 72300 │ 100 │ ╘════════╧═══════════════════════════╛
biv_hist("MORTDUE")
print("Percentage of Default per category")
key_feat("MORTDUE", 10)
Percentage of Default per category ╒═══════════╤═══════════════════════════╕ │ MORTDUE │ % of defaulters (BAD=1) │ ╞═══════════╪═══════════════════════════╡ │ 2063 │ 100 │ ├───────────┼───────────────────────────┤ │ 58500 │ 100 │ ├───────────┼───────────────────────────┤ │ 58307 │ 100 │ ├───────────┼───────────────────────────┤ │ 58204 │ 100 │ ├───────────┼───────────────────────────┤ │ 58104 │ 100 │ ├───────────┼───────────────────────────┤ │ 58100 │ 100 │ ├───────────┼───────────────────────────┤ │ 58016 │ 100 │ ├───────────┼───────────────────────────┤ │ 57993 │ 100 │ ├───────────┼───────────────────────────┤ │ 57988 │ 100 │ ├───────────┼───────────────────────────┤ │ 57939 │ 100 │ ╘═══════════╧═══════════════════════════╛
biv_hist("VALUE")
print("Percentage of Default per category")
key_feat("VALUE", 10)
Percentage of Default per category ╒═════════╤═══════════════════════════╕ │ VALUE │ % of defaulters (BAD=1) │ ╞═════════╪═══════════════════════════╡ │ 855909 │ 100 │ ├─────────┼───────────────────────────┤ │ 135822 │ 100 │ ├─────────┼───────────────────────────┤ │ 66700 │ 100 │ ├─────────┼───────────────────────────┤ │ 66660 │ 100 │ ├─────────┼───────────────────────────┤ │ 66603 │ 100 │ ├─────────┼───────────────────────────┤ │ 135317 │ 100 │ ├─────────┼───────────────────────────┤ │ 135447 │ 100 │ ├─────────┼───────────────────────────┤ │ 66403 │ 100 │ ├─────────┼───────────────────────────┤ │ 135917 │ 100 │ ├─────────┼───────────────────────────┤ │ 137390 │ 100 │ ╘═════════╧═══════════════════════════╛
biv_hist("YOJ")
print("Percentage of Default per category")
key_feat("YOJ", 10)
Percentage of Default per category ╒═══════╤═══════════════════════════╕ │ YOJ │ % of defaulters (BAD=1) │ ╞═══════╪═══════════════════════════╡ │ 41 │ 100 │ ├───────┼───────────────────────────┤ │ 9.8 │ 100 │ ├───────┼───────────────────────────┤ │ 9.3 │ 100 │ ├───────┼───────────────────────────┤ │ 2.2 │ 100 │ ├───────┼───────────────────────────┤ │ 9.2 │ 100 │ ├───────┼───────────────────────────┤ │ 2.6 │ 100 │ ├───────┼───────────────────────────┤ │ 2.9 │ 100 │ ├───────┼───────────────────────────┤ │ 8.9 │ 100 │ ├───────┼───────────────────────────┤ │ 3.2 │ 100 │ ├───────┼───────────────────────────┤ │ 8.3 │ 100 │ ╘═══════╧═══════════════════════════╛
biv_hist("DEROG")
print("Percentage of Default per category")
key_feat("DEROG", 10)
Percentage of Default per category ╒═════════╤═══════════════════════════╕ │ DEROG │ % of defaulters (BAD=1) │ ╞═════════╪═══════════════════════════╡ │ 7 │ 100 │ ├─────────┼───────────────────────────┤ │ 8 │ 100 │ ├─────────┼───────────────────────────┤ │ 9 │ 100 │ ├─────────┼───────────────────────────┤ │ 10 │ 100 │ ├─────────┼───────────────────────────┤ │ 4 │ 78.2609 │ ├─────────┼───────────────────────────┤ │ 3 │ 74.1379 │ ├─────────┼───────────────────────────┤ │ 6 │ 66.6667 │ ├─────────┼───────────────────────────┤ │ 2 │ 51.25 │ ├─────────┼───────────────────────────┤ │ 5 │ 46.6667 │ ├─────────┼───────────────────────────┤ │ 1 │ 38.8506 │ ╘═════════╧═══════════════════════════╛
Note for the Corrector: I had an issue with the code using the function - I raised a ticket but in the mean time I made a separate function fro DELINQ and CLAGE.
title=title_graph("DELINQ")
color_map={0:"darkblue", 1:"darkseagreen"}
fig=px.histogram(dt,x="DELINQ", color="BAD", barmode="stack", nbins=10, opacity=0.6, color_discrete_map=color_map, text_auto=True, title=title)
fig.update_traces(textposition='outside')
fig.update_layout(height=600, width=600, bargap=0.05, title=dict(text=title, x=0.5, xanchor="center", font=dict(size=24)))
fig.show()
print("Percentage of Default per category")
key_feat("DELINQ", 10)
Percentage of Default per category ╒══════════╤═══════════════════════════╕ │ DELINQ │ % of defaulters (BAD=1) │ ╞══════════╪═══════════════════════════╡ │ 6 │ 100 │ ├──────────┼───────────────────────────┤ │ 7 │ 100 │ ├──────────┼───────────────────────────┤ │ 8 │ 100 │ ├──────────┼───────────────────────────┤ │ 10 │ 100 │ ├──────────┼───────────────────────────┤ │ 11 │ 100 │ ├──────────┼───────────────────────────┤ │ 12 │ 100 │ ├──────────┼───────────────────────────┤ │ 13 │ 100 │ ├──────────┼───────────────────────────┤ │ 15 │ 100 │ ├──────────┼───────────────────────────┤ │ 5 │ 81.5789 │ ├──────────┼───────────────────────────┤ │ 4 │ 58.9744 │ ╘══════════╧═══════════════════════════╛
title=title_graph("CLAGE")
color_map={0:"darkblue", 1:"darkseagreen"}
fig=px.histogram(dt,x="CLAGE", color="BAD", barmode="stack", nbins=10, opacity=0.6, color_discrete_map=color_map, text_auto=True, title=title)
fig.update_traces(textposition='outside')
fig.update_layout(height=600, width=600, bargap=0.05, title=dict(text=title, x=0.5, xanchor="center", font=dict(size=24)))
fig.show()
print("Percentage of Default per category")
key_feat("CLAGE", 10)
Percentage of Default per category ╒═════════╤═══════════════════════════╕ │ CLAGE │ % of defaulters (BAD=1) │ ╞═════════╪═══════════════════════════╡ │ 0 │ 100 │ ├─────────┼───────────────────────────┤ │ 130.8 │ 100 │ ├─────────┼───────────────────────────┤ │ 130.36 │ 100 │ ├─────────┼───────────────────────────┤ │ 130.342 │ 100 │ ├─────────┼───────────────────────────┤ │ 130.3 │ 100 │ ├─────────┼───────────────────────────┤ │ 130.033 │ 100 │ ├─────────┼───────────────────────────┤ │ 129.968 │ 100 │ ├─────────┼───────────────────────────┤ │ 129.934 │ 100 │ ├─────────┼───────────────────────────┤ │ 129.9 │ 100 │ ├─────────┼───────────────────────────┤ │ 129.833 │ 100 │ ╘═════════╧═══════════════════════════╛
biv_hist("NINQ")
print("Percentage of Default per category")
key_feat("NINQ", 10)
Percentage of Default per category ╒════════╤═══════════════════════════╕ │ NINQ │ % of defaulters (BAD=1) │ ╞════════╪═══════════════════════════╡ │ 12 │ 100 │ ├────────┼───────────────────────────┤ │ 13 │ 100 │ ├────────┼───────────────────────────┤ │ 14 │ 100 │ ├────────┼───────────────────────────┤ │ 17 │ 100 │ ├────────┼───────────────────────────┤ │ 9 │ 54.5455 │ ├────────┼───────────────────────────┤ │ 6 │ 51.7857 │ ├────────┼───────────────────────────┤ │ 8 │ 50 │ ├────────┼───────────────────────────┤ │ 5 │ 48 │ ├────────┼───────────────────────────┤ │ 4 │ 39.1026 │ ├────────┼───────────────────────────┤ │ 7 │ 34.0909 │ ╘════════╧═══════════════════════════╛
biv_hist("CLNO")
print("Percentage of Default per category")
key_feat("CLNO", 10)
Percentage of Default per category ╒════════╤═══════════════════════════╕ │ CLNO │ % of defaulters (BAD=1) │ ╞════════╪═══════════════════════════╡ │ 71 │ 100 │ ├────────┼───────────────────────────┤ │ 2 │ 100 │ ├────────┼───────────────────────────┤ │ 65 │ 100 │ ├────────┼───────────────────────────┤ │ 64 │ 100 │ ├────────┼───────────────────────────┤ │ 63 │ 100 │ ├────────┼───────────────────────────┤ │ 58 │ 100 │ ├────────┼───────────────────────────┤ │ 57 │ 100 │ ├────────┼───────────────────────────┤ │ 53 │ 100 │ ├────────┼───────────────────────────┤ │ 1 │ 83.3333 │ ├────────┼───────────────────────────┤ │ 45 │ 50 │ ╘════════╧═══════════════════════════╛
biv_hist("DEBTINC")
print("Percentage of Default per category")
key_feat("DEBTINC", 10)
Percentage of Default per category ╒═══════════╤═══════════════════════════╕ │ DEBTINC │ % of defaulters (BAD=1) │ ╞═══════════╪═══════════════════════════╡ │ 0.524499 │ 100 │ ├───────────┼───────────────────────────┤ │ 35.968 │ 100 │ ├───────────┼───────────────────────────┤ │ 35.7293 │ 100 │ ├───────────┼───────────────────────────┤ │ 35.7306 │ 100 │ ├───────────┼───────────────────────────┤ │ 35.7336 │ 100 │ ├───────────┼───────────────────────────┤ │ 35.7598 │ 100 │ ├───────────┼───────────────────────────┤ │ 35.7641 │ 100 │ ├───────────┼───────────────────────────┤ │ 35.7749 │ 100 │ ├───────────┼───────────────────────────┤ │ 35.7884 │ 100 │ ├───────────┼───────────────────────────┤ │ 35.8867 │ 100 │ ╘═══════════╧═══════════════════════════╛
biv_hist("REASON")
print("Percentage of Default per category")
key_feat("REASON", 10)
Percentage of Default per category ╒══════════╤═══════════════════════════╕ │ REASON │ % of defaulters (BAD=1) │ ╞══════════╪═══════════════════════════╡ │ HomeImp │ 22.2472 │ ├──────────┼───────────────────────────┤ │ DebtCon │ 18.9664 │ ╘══════════╧═══════════════════════════╛
To improve readibility of the pairplot, we have selected only the numerical varibales, and placed on the x axis the features that seem to represent a stronger correlation with default, as identified in the correlation heatmap (DEORG and DELINC). We also add DEBTINC as such is identified in the problem statement as a feature of specific interest for the banks.
sns.pairplot(dt, hue="BAD", x_vars=['DEROG','DELINQ', 'DEBTINC'], y_vars=["LOAN", "MORTDUE", 'VALUE', 'YOJ', 'CLAGE', 'NINQ', 'CLNO'],height=2)
<seaborn.axisgrid.PairGrid at 0x2b1c452d0>
Observation:
Numerical variables - default risk level
Categorical variables - default risk level
Very interestingly, we note, through the % analysis that:
During our EDA, we noted that a lot of features have outliers, however such have varying densities (cf boxplots above). In order to limit the bias on the model, we will treat the outliers following the said logic:
As demonstrated in the EDA, it appears that for a number of variables, some sub-groups (bins) represent a higher relative risk than others. These sub-groups, as they represent a minority, are at a higher risk of being located outside Q1-Q3, and treating the outliers at this stage would erase these important details from the model building.
def outliers(col):
Q1 = dt[col].quantile(0.25)
Q3 = dt[col].quantile(0.75)
IQR = Q3 - Q1
min_value=Q1 - 1.5*IQR
max_value=Q3 + 1.5*IQR
dt[col]=dt[col].clip(lower=min_value, upper=max_value)
return dt
outliers("YOJ")
outliers("NINQ")
outliers("CLNO")
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.000 | 39025.000 | 0 | 0 | 10.500 | 0.000 | 0.000 | 94.367 | 1.000 | 9.000 | 34.818 |
| 1 | 1 | 1300 | 70053.000 | 68400.000 | 0 | 0 | 7.000 | 0.000 | 2.000 | 121.833 | 0.000 | 14.000 | 34.818 |
| 2 | 1 | 1500 | 13500.000 | 16700.000 | 0 | 0 | 4.000 | 0.000 | 0.000 | 149.467 | 1.000 | 10.000 | 34.818 |
| 3 | 1 | 1500 | 65019.000 | 89235.500 | 1 | 0 | 7.000 | 0.000 | 0.000 | 173.467 | 1.000 | 20.000 | 34.818 |
| 4 | 0 | 1700 | 97800.000 | 112000.000 | 0 | 1 | 3.000 | 0.000 | 0.000 | 93.333 | 0.000 | 14.000 | 34.818 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5955 | 0 | 88900 | 57264.000 | 90185.000 | 1 | 0 | 16.000 | 0.000 | 0.000 | 221.809 | 0.000 | 16.000 | 36.112 |
| 5956 | 0 | 89000 | 54576.000 | 92937.000 | 1 | 0 | 16.000 | 0.000 | 0.000 | 208.692 | 0.000 | 15.000 | 35.860 |
| 5957 | 0 | 89200 | 54045.000 | 92924.000 | 1 | 0 | 15.000 | 0.000 | 0.000 | 212.280 | 0.000 | 15.000 | 35.557 |
| 5958 | 0 | 89800 | 50370.000 | 91861.000 | 1 | 0 | 14.000 | 0.000 | 0.000 | 213.893 | 0.000 | 16.000 | 34.341 |
| 5959 | 0 | 89900 | 48811.000 | 88934.000 | 1 | 0 | 15.000 | 0.000 | 0.000 | 219.601 | 0.000 | 16.000 | 34.572 |
5960 rows × 13 columns
As previously reflected in the EDA, most variables present missing values. As the proportion of NaN remains reasonable (from 1 to 21% of total rows), we will proceed with replacing the NaN by the median value, for numerical data, and by the mode, for categorical data.
def fill_na_num(col):
median=dt[col].median()
dt[col].fillna(median, inplace=True)
return dt
#Filling missing numerical values with mean
fill_na_num("MORTDUE")
fill_na_num("VALUE")
fill_na_num("YOJ")
fill_na_num("DEROG")
fill_na_num("DELINQ")
fill_na_num("CLAGE")
fill_na_num("NINQ")
fill_na_num("CLNO")
fill_na_num("DEBTINC")
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.000 | 39025.000 | HomeImp | Other | 10.500 | 0.000 | 0.000 | 94.367 | 1.000 | 9.000 | 34.818 |
| 1 | 1 | 1300 | 70053.000 | 68400.000 | HomeImp | Other | 7.000 | 0.000 | 2.000 | 121.833 | 0.000 | 14.000 | 34.818 |
| 2 | 1 | 1500 | 13500.000 | 16700.000 | HomeImp | Other | 4.000 | 0.000 | 0.000 | 149.467 | 1.000 | 10.000 | 34.818 |
| 3 | 1 | 1500 | 65019.000 | 89235.500 | NaN | NaN | 7.000 | 0.000 | 0.000 | 173.467 | 1.000 | 20.000 | 34.818 |
| 4 | 0 | 1700 | 97800.000 | 112000.000 | HomeImp | Office | 3.000 | 0.000 | 0.000 | 93.333 | 0.000 | 14.000 | 34.818 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5955 | 0 | 88900 | 57264.000 | 90185.000 | DebtCon | Other | 16.000 | 0.000 | 0.000 | 221.809 | 0.000 | 16.000 | 36.112 |
| 5956 | 0 | 89000 | 54576.000 | 92937.000 | DebtCon | Other | 16.000 | 0.000 | 0.000 | 208.692 | 0.000 | 15.000 | 35.860 |
| 5957 | 0 | 89200 | 54045.000 | 92924.000 | DebtCon | Other | 15.000 | 0.000 | 0.000 | 212.280 | 0.000 | 15.000 | 35.557 |
| 5958 | 0 | 89800 | 50370.000 | 91861.000 | DebtCon | Other | 14.000 | 0.000 | 0.000 | 213.893 | 0.000 | 16.000 | 34.341 |
| 5959 | 0 | 89900 | 48811.000 | 88934.000 | DebtCon | Other | 15.000 | 0.000 | 0.000 | 219.601 | 0.000 | 16.000 | 34.572 |
5960 rows × 13 columns
#Filling categorical missing values with mode
def fill_na_cat(col):
mode=dt[col].mode()[0]
dt[col].fillna(mode,inplace=True)
return dt
fill_na_cat("REASON")
fill_na_cat("JOB")
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.000 | 39025.000 | HomeImp | Other | 10.500 | 0.000 | 0.000 | 94.367 | 1.000 | 9.000 | 34.818 |
| 1 | 1 | 1300 | 70053.000 | 68400.000 | HomeImp | Other | 7.000 | 0.000 | 2.000 | 121.833 | 0.000 | 14.000 | 34.818 |
| 2 | 1 | 1500 | 13500.000 | 16700.000 | HomeImp | Other | 4.000 | 0.000 | 0.000 | 149.467 | 1.000 | 10.000 | 34.818 |
| 3 | 1 | 1500 | 65019.000 | 89235.500 | DebtCon | Other | 7.000 | 0.000 | 0.000 | 173.467 | 1.000 | 20.000 | 34.818 |
| 4 | 0 | 1700 | 97800.000 | 112000.000 | HomeImp | Office | 3.000 | 0.000 | 0.000 | 93.333 | 0.000 | 14.000 | 34.818 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5955 | 0 | 88900 | 57264.000 | 90185.000 | DebtCon | Other | 16.000 | 0.000 | 0.000 | 221.809 | 0.000 | 16.000 | 36.112 |
| 5956 | 0 | 89000 | 54576.000 | 92937.000 | DebtCon | Other | 16.000 | 0.000 | 0.000 | 208.692 | 0.000 | 15.000 | 35.860 |
| 5957 | 0 | 89200 | 54045.000 | 92924.000 | DebtCon | Other | 15.000 | 0.000 | 0.000 | 212.280 | 0.000 | 15.000 | 35.557 |
| 5958 | 0 | 89800 | 50370.000 | 91861.000 | DebtCon | Other | 14.000 | 0.000 | 0.000 | 213.893 | 0.000 | 16.000 | 34.341 |
| 5959 | 0 | 89900 | 48811.000 | 88934.000 | DebtCon | Other | 15.000 | 0.000 | 0.000 | 219.601 | 0.000 | 16.000 | 34.572 |
5960 rows × 13 columns
dt.isna().sum()
BAD 0 LOAN 0 MORTDUE 0 VALUE 0 REASON 0 JOB 0 YOJ 0 DEROG 0 DELINQ 0 CLAGE 0 NINQ 0 CLNO 0 DEBTINC 0 dtype: int64
dt.REASON=pd.factorize(dt.REASON)[0]
dt.JOB=pd.factorize(dt.JOB)[0]
dt.sample(10)
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1113 | 0 | 9900 | 8673.000 | 37244.000 | 1 | 0 | 4.000 | 0.000 | 0.000 | 256.681 | 0.000 | 12.000 | 24.496 |
| 432 | 0 | 6700 | 69773.000 | 83014.000 | 0 | 0 | 0.000 | 1.000 | 0.000 | 263.173 | 0.000 | 5.000 | 37.707 |
| 2614 | 0 | 15000 | 46867.000 | 67529.000 | 1 | 0 | 7.000 | 0.000 | 0.000 | 133.788 | 3.000 | 48.000 | 31.486 |
| 3397 | 0 | 18000 | 45000.000 | 68500.000 | 1 | 4 | 10.000 | 0.000 | 0.000 | 190.800 | 1.000 | 42.000 | 34.818 |
| 4321 | 0 | 22500 | 65019.000 | 143254.000 | 1 | 0 | 7.000 | 0.000 | 0.000 | 173.467 | 1.000 | 20.000 | 34.818 |
| 5611 | 0 | 37900 | 150044.000 | 202894.000 | 0 | 4 | 5.000 | 0.000 | 0.000 | 156.034 | 0.000 | 30.000 | 36.311 |
| 5507 | 0 | 33700 | 147522.000 | 198737.000 | 1 | 4 | 20.000 | 0.000 | 0.000 | 188.149 | 1.000 | 21.000 | 37.309 |
| 2675 | 0 | 15200 | 71233.000 | 84439.000 | 1 | 0 | 1.000 | 0.000 | 0.000 | 242.983 | 2.000 | 26.000 | 32.210 |
| 837 | 0 | 8600 | 124810.000 | 176581.000 | 1 | 0 | 8.000 | 0.000 | 0.000 | 213.303 | 1.000 | 19.000 | 32.876 |
| 4580 | 0 | 23900 | 87884.000 | 116559.000 | 1 | 3 | 23.000 | 0.000 | 0.000 | 314.394 | 0.000 | 42.000 | 37.988 |
dt_1=dt.copy() # make a copy to try different scaling methods
scaler=StandardScaler()
scaled_features = scaler.fit_transform(dt_1.drop("BAD", axis=1))
scaled_data = pd.DataFrame(scaled_features, columns=dt_1.drop("BAD", axis=1).columns)
scaled_data["BAD"] = dt_1["BAD"]
#Train-Test Split
x_1=scaled_data.drop("BAD",axis=1)
y_1=scaled_data["BAD"]
x1_train, x1_test,y1_train, y1_test=train_test_split(x_1,y_1,test_size=0.2, random_state=1)
def adj_r2_score(predictors, targets, predictions):
r2=r2_score(targets, predictions)
n=predictors.shape[0]
k=predictors.shape[1]
return 1 - ((1-r2)*(n-1)/(n-k-1))
def mape_score(targets, predictions):
return np.abs(np.abs(targets - predictions)/targets)*100
def model_performance_regression (model, predictors, target):
pred=model.predict(predictors)
r2=r2_score(target, pred)
adjr2=adj_r2_score(predictors, target,pred)
rmse=np.sqrt(mean_squared_error(target,pred))
mae=mean_absolute_error(target,pred)
mape=mape_score(target,pred)
df_perf=pd.DataFrame({"RMSE":rmse, "MAE":mae, "R-Squared":r2, "Adj. R-Squared":adjr2, "MAPE":mape,}, index=[0],)
return df_perf
log=LogisticRegression()
log.fit(x1_train, y1_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
y_train_pred_log=log.predict(x1_train)
mape_score(y1_train, y_train_pred_log)
5292 0.000
1162 NaN
2355 NaN
3620 NaN
1247 NaN
...
905 0.000
5192 NaN
3980 NaN
235 NaN
5157 NaN
Name: BAD, Length: 4768, dtype: float64
y_test_pred_log=log.predict(x1_test)
mape_score(y1_test, y_test_pred_log)
4146 0.000
4103 NaN
5439 NaN
997 NaN
5908 NaN
...
4171 NaN
100 0.000
5164 NaN
5208 NaN
4538 NaN
Name: BAD, Length: 1192, dtype: float64
log_test= model_performance_regression(tree, x1_test, y1_test)
log_test
| RMSE | MAE | R-Squared | Adj. R-Squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.890 | 0.792 | -3.806 | -3.855 | NaN |
Observations: the model is performing interestingly poorly as per the indicators. We note a high number of missing values, which might impact the perfomance, however suspect that quite a lot fo work would be requested to improve performance (mobilising PCA and KNN amongst others).
However, we are convinced that Decision Tree, Random Forest and other models should be able to provide good results with limited computational needs, and while ensuring a degree of readability for the user.
We would suggest moving on to the Decision Tree model.
Encoded Data Train-Test Split
x=dt.drop(["BAD"], axis=1)
y=dt["BAD"]
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.3, random_state=1, stratify=y)
print("Shape of the training data: ", x_train.shape)
print("Shape of the test data: ", x_test.shape)
Shape of the training data: (4172, 12) Shape of the test data: (1788, 12)
Non-linear Model Performance Metrics
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm=confusion_matrix (actual, predicted)
plt.figure(figsize=(4,4))
sns.heatmap(cm, annot=True, fmt="-2f", xticklabels=["Non-Default", "Default"], yticklabels=["Non-Default", "Default"])
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show
def model_performance_classification (model, predictors, target):
pred = model.predict(predictors)
recall = recall_score(target,pred,average="macro")
precision = precision_score(target,pred,average="macro")
acc = accuracy_score(target,pred)
#Create Dataframe Metrics
df_perf=pd.DataFrame({"Precision":precision, "Recall":recall, "Accuracy":acc,}, index=[0],)
return df_perf
The problem we are aiming to solve, namely identifying wether clients are likely to default or not on their loan, is inherently a classification problem. We will therefore use a Decision Tree Classifier, along with Randomf Forest Classifier models.
Dataset imbalance: BAD=1 only represents 19.94% of our overall dataset. To avoid any bia, we wll tune the class-weight hyperparameter from the beginning to avoid any bias (rounding 19.94 to 20).
tree=DecisionTreeClassifier(class_weight={0:0.80, 1:0.2}, random_state=1)
tree.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.8, 1: 0.2}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.8, 1: 0.2}, random_state=1)y_train_pred_tree=tree.predict(x_train)
metrics_score(y_train, y_train_pred_tree)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Pure overfitting: the model is projecting 100% across the board. Let's go to the test data to confirm.
y_test_pred_tree=tree.predict(x_test)
metrics_score(y_test, y_test_pred_tree)
precision recall f1-score support
0 0.91 0.93 0.92 1431
1 0.70 0.64 0.67 357
accuracy 0.87 1788
macro avg 0.81 0.79 0.80 1788
weighted avg 0.87 0.87 0.87 1788
dtree_test= model_performance_classification(tree, x_test, y_test)
dtree_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.806 | 0.788 | 0.874 |
Observations: the model is vastly overfiting as we note that:
Decision Tree - Visualisation
from sklearn.tree import plot_tree
features=list(x.columns)
plt.figure(figsize=(35,7
))
plot_tree(tree, max_depth=5, feature_names=features, filled=True, fontsize=10, node_ids=True, class_names=None)
plt.show()
Observations
Thereore, we will resort to visualising the important features.
importance=tree.feature_importances_
columns=x.columns
importance_tree=pd.DataFrame(importance, index=columns, columns=["Importance"]).sort_values(by="Importance",ascending=False)
plt.figure(figsize=(8,5))
palette=sns.color_palette("viridis", len(importance_tree))
sns.barplot(x=importance_tree.Importance, y=importance_tree.index, palette=palette)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.title("Feature Importance Decision Tree")
Text(0.5, 1.0, 'Feature Importance Decision Tree')
Observation: we note at this stage that the 5 top features identified by the Decision Tree model are:
In order to clarify the importance of the features and, potentially, proceed to more feautre engineering, we will tune the model. At the moment, the least 3 relevant features appear to be REASON, JOB and NINQ.
dtree_estimator= DecisionTreeClassifier(class_weight={0:0.80, 1:0.2}, random_state=1)
parameters={"criterion": ["gini", "entropy"], "max_depth":[5,10,20,30,40], "min_samples_leaf":[1,2,4,6,8,10]}
scorer=metrics.make_scorer(recall_score, pos_label=1)
gridCV=GridSearchCV(dtree_estimator, parameters, scoring=scorer, cv=10)
gridCV = gridCV.fit(x_train, y_train)
dtree_estimator=gridCV.best_estimator_
dtree_estimator.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.8, 1: 0.2}, max_depth=30,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.8, 1: 0.2}, max_depth=30,
random_state=1)y_train_pred_tree=dtree_estimator.predict(x_train)
metrics_score(y_train, y_train_pred_tree)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_pred_tree = dtree_estimator.predict(x_test)
metrics_score(y_test, y_test_pred_tree)
precision recall f1-score support
0 0.91 0.93 0.92 1431
1 0.70 0.63 0.66 357
accuracy 0.87 1788
macro avg 0.81 0.78 0.79 1788
weighted avg 0.87 0.87 0.87 1788
dtree_tuned_test=model_performance_classification(dtree_estimator, x_test, y_test)
dtree_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.805 | 0.783 | 0.872 |
Observations: rather than increasing, performance has stalled. It seems that the hyperparameters chosen by the GridSearchCV are not the best fit.
dtree_tuned2= DecisionTreeClassifier(class_weight={0:0.80, 1:0.2}, random_state=1)
parameters={"criterion": ["gini", "entropy"], "max_depth":[5,10,15], "min_samples_leaf":[2,4,6,8,10]}
scorer=metrics.make_scorer(recall_score, pos_label=1)
gridCV=GridSearchCV(dtree_tuned2, parameters, scoring=scorer, cv=10)
gridCV = gridCV.fit(x_train, y_train)
dtree_tuned2=gridCV.best_estimator_
dtree_tuned2.fit(x_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.8, 1: 0.2}, criterion='entropy',
max_depth=15, min_samples_leaf=2, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.8, 1: 0.2}, criterion='entropy',
max_depth=15, min_samples_leaf=2, random_state=1)y_train_tuned2_tree=dtree_estimator.predict(x_train)
metrics_score(y_train, y_train_tuned2_tree)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_tuned2_tree=dtree_estimator.predict(x_test)
metrics_score(y_test, y_test_tuned2_tree)
precision recall f1-score support
0 0.91 0.93 0.92 1431
1 0.70 0.63 0.66 357
accuracy 0.87 1788
macro avg 0.81 0.78 0.79 1788
weighted avg 0.87 0.87 0.87 1788
dtree_tuned2_test=model_performance_classification(dtree_tuned2, x_test, y_test)
dtree_tuned2_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.832 | 0.768 | 0.880 |
Observations the new tuning resulted in decreasing the size of the tree (to 15), while setting a minimum sample size (at 2). The result decreased the recall further (.757), while improving the precision and the accuracy (.832 and .880) Recall being the most important feature, we will disregard these findings for the visualisation steps below.
As the first tuned decision tree performed better, we will use it as a basis for the visualisations.
#Decision Tree Tuned Visualisation
features=list(x.columns)
plt.figure(figsize=(25,10
))
plot_tree(dtree_estimator, max_depth=5, feature_names=features, filled=True, fontsize=10, node_ids=True, class_names=None)
plt.show()
Observation :
Thereore, we will resort to visualising the important features.
#Feature Importance - Decision Tree Tuned
importance=dtree_estimator.feature_importances_
columns=x.columns
importance_tree_tuned=pd.DataFrame(importance, index=columns, columns=["Importance"]).sort_values(by="Importance",ascending=False)
plt.figure(figsize=(8,5))
palette=sns.color_palette("viridis", len(importance_tree))
sns.barplot(x=importance_tree_tuned.Importance, y=importance_tree_tuned.index, palette=palette)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.title("Feature Importance Decision Tree Tuned")
Text(0.5, 1.0, 'Feature Importance Decision Tree Tuned')
Observations: the key features remain largely the same, but change in order, and their importance changes in magnitude:
Least important value remains REASON, JOB and NINQ. However the importance of REASON seems to have decreased some more with the tuning. Before proceding with the feature engineering, and droping some less-important features, we will proceed with the Random Forest Classifier model to rely on the findings of a more robust model before dropping any information
rf_1=RandomForestClassifier(class_weight={0:0.80, 1:0.2}, random_state=1)
rf_1.fit(x_train, y_train)
RandomForestClassifier(class_weight={0: 0.8, 1: 0.2}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.8, 1: 0.2}, random_state=1)y_train_rf1=rf_1.predict(x_train)
metrics_score(y_train, y_train_rf1)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_rf1=rf_1.predict(x_test)
metrics_score(y_test, y_test_rf1)
precision recall f1-score support
0 0.92 0.98 0.95 1431
1 0.87 0.66 0.75 357
accuracy 0.91 1788
macro avg 0.90 0.82 0.85 1788
weighted avg 0.91 0.91 0.91 1788
rf_1_test=model_performance_classification(rf_1, x_test, y_test)
rf_1_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.895 | 0.817 | 0.912 |
Observations: the model does not seem to be overfitting (training results: 100%). The Random Forest Classifier model is performing better than the Decision Tree:
The performance of the Accuracy are starting to be acceptable, however our focus is on the Recall rate, which should be maximised. To do so, we will first tune the hyperparameters, before attempting feature engineering measures.
#Feature Importance Visualisation
importances_rf1=rf_1.feature_importances_
columns=x.columns
importance_rf_1=pd.DataFrame(importances_rf1, index=columns, columns=["Importance"]).sort_values(by="Importance", ascending=False)
palette=sns.color_palette("viridis", len(importance_rf_1))
sns.barplot(x=importance_rf_1.Importance, y=importance_rf_1.index, palette=palette)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.title("Feature Importance - Random Forest")
Text(0.5, 1.0, 'Feature Importance - Random Forest')
Observation: If the key features seem to remain the same, their order is changing from the Decision Tree tuned model:
Least important features: REASON, JOB, NINQ remain the least important features (although we note an inversion in the order between JOB and NINQ).
The tuning of the RF model should help us better differenciate the features by importance, and better identify the most and least important features.
rf_1_tuned=RandomForestClassifier({0:0.80, 1:0.2}, random_state=1)
params_rf={"n_estimators":[100,200,300], "max_depth":[None, 10, 20], "max_features":["auto", "sqrt"]}
scorer=metrics.make_scorer(recall_score, pos_label=1)
grid_obj=GridSearchCV(rf_1_tuned, params_rf, scoring=scorer, cv=5)
grid_obj=grid_obj.fit(x_train, y_train)
rf_1_tuned=grid_obj.best_estimator_
rf_1_tuned.fit(x_train, y_train)
RandomForestClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=1)
y_train_rf_tuned1=rf_1_tuned.predict(x_train)
metrics_score(y_train, y_train_rf_tuned1)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_rf_tuned1=rf_1_tuned.predict(x_test)
metrics_score(y_test, y_test_rf_tuned1)
precision recall f1-score support
0 0.92 0.98 0.95 1431
1 0.89 0.67 0.76 357
accuracy 0.92 1788
macro avg 0.90 0.82 0.86 1788
weighted avg 0.91 0.92 0.91 1788
rf_1_tuned_test=model_performance_classification(rf_1_tuned,x_test,y_test)
rf_1_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.904 | 0.824 | 0.917 |
Observations: the model does not seem to be overfitting (training results: 100%), and the performance slightly improves on all metrics:
The performance of the Accuracy are starting to be acceptable, however our focus is on the Recall rate, which should be maximised. To do so, we will first tune the hyperparameters, before attempting feature engineering measures.
#Feature Importance Visualisation
importances_rf1_tuned=rf_1_tuned.feature_importances_
columns=x.columns
importances_rf1_tuned=pd.DataFrame(importances_rf1_tuned, index=columns, columns=["Importance"]).sort_values(by="Importance", ascending=False)
palette=sns.color_palette("viridis", len(importances_rf1_tuned))
sns.barplot(x=importances_rf1_tuned.Importance, y=importances_rf1_tuned.index, palette=palette)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.title("Feature Importance - Random Forest Tuned")
Text(0.5, 1.0, 'Feature Importance - Random Forest Tuned')
Observations: the key features remain the same, but the order changes again:
Least important features: still REASON, JOB, NINQ. We will proceed with an incremental feature engineering, by interaticely dropping the lowest values.
We iteratively dropped REASON, JOB and NINQ, but as performance dropped for JOB and NINQ, we reverted to displaying (below) the first option which only involved dropping REASON. The results of all iterations are visible in the Observations at the end of the section.
dt_drop=dt.copy()
dt_drop=dt_drop.drop(["REASON"],axis=1)
dt_drop.head()
| BAD | LOAN | MORTDUE | VALUE | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.000 | 39025.000 | 0 | 10.500 | 0.000 | 0.000 | 94.367 | 1.000 | 9.000 | 34.818 |
| 1 | 1 | 1300 | 70053.000 | 68400.000 | 0 | 7.000 | 0.000 | 2.000 | 121.833 | 0.000 | 14.000 | 34.818 |
| 2 | 1 | 1500 | 13500.000 | 16700.000 | 0 | 4.000 | 0.000 | 0.000 | 149.467 | 1.000 | 10.000 | 34.818 |
| 3 | 1 | 1500 | 65019.000 | 89235.500 | 0 | 7.000 | 0.000 | 0.000 | 173.467 | 1.000 | 20.000 | 34.818 |
| 4 | 0 | 1700 | 97800.000 | 112000.000 | 1 | 3.000 | 0.000 | 0.000 | 93.333 | 0.000 | 14.000 | 34.818 |
x2=dt_drop.drop(["BAD"], axis=1)
y2=dt_drop["BAD"]
x2_train, x2_test, y2_train, y2_test=train_test_split(x2,y2,test_size=0.3, random_state=1, stratify=y2)
print("x2 shape:", x2.shape)
print("y2 shape:", y2.shape)
x2 shape: (5960, 11) y2 shape: (5960,)
rf_1_tuned_drop=RandomForestClassifier(class_weight={0:0.80, 1:0.2}, random_state=1)
params_rf={"n_estimators":[100,200,300], "max_depth":[None, 10, 20], "max_features":["auto", "sqrt"]}
scorer=metrics.make_scorer(recall_score, pos_label=1)
grid_obj=GridSearchCV(rf_1_tuned_drop, params_rf, scoring=scorer, cv=5)
grid_obj=grid_obj.fit(x2_train, y2_train)
rf_1_tuned_drop=grid_obj.best_estimator_
rf_1_tuned_drop.fit(x2_train, y2_train)
RandomForestClassifier(class_weight={0: 0.8, 1: 0.2}, n_estimators=300,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.8, 1: 0.2}, n_estimators=300,
random_state=1)y_train_rf_tuned1drop=rf_1_tuned_drop.predict(x2_train)
metrics_score(y2_train, y_train_rf_tuned1drop)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_rf_tuned1drop=rf_1_tuned_drop.predict(x2_test)
metrics_score(y2_test, y_test_rf_tuned1drop)
precision recall f1-score support
0 0.93 0.97 0.95 1431
1 0.86 0.69 0.76 357
accuracy 0.91 1788
macro avg 0.89 0.83 0.85 1788
weighted avg 0.91 0.91 0.91 1788
rf_1_tuned_drop_test=model_performance_classification(rf_1_tuned_drop,x2_test,y2_test)
rf_1_tuned_drop_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.891 | 0.829 | 0.914 |
Observations: our performance improved by dropping REASON, but then dropped by dropping iteratively JOB and NINQ: (all 3 iterations not visible in the code due to readability concerns)
We will therefore revert and only drop REASON from our main dataset and continue the tuning on that basis. We will later evaluate if dropping further options improves performance.
rf_2_tuned=RandomForestClassifier(class_weight={0:0.80, 1:0.2}, random_state=1)
params_rf={"n_estimators":[100,200,300], "max_depth":[None, 10, 20], "max_features":["auto", "sqrt"], "min_samples_split": [2, 5, 10], "min_samples_leaf":[1,2,4]}
scorer=metrics.make_scorer(recall_score, pos_label=1)
grid_obj=GridSearchCV(rf_1_tuned_drop, params_rf, scoring=scorer, cv=5)
grid_obj=grid_obj.fit(x2_train, y2_train)
rf_2_tuned=grid_obj.best_estimator_
rf_2_tuned.fit(x2_train, y2_train)
RandomForestClassifier(class_weight={0: 0.8, 1: 0.2}, n_estimators=300,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.8, 1: 0.2}, n_estimators=300,
random_state=1)y_train_rf_tuned2=rf_2_tuned.predict(x2_train)
metrics_score(y2_train, y_train_rf_tuned2)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_rf_tuned2=rf_2_tuned.predict(x2_test)
metrics_score(y2_test, y_test_rf_tuned2)
precision recall f1-score support
0 0.93 0.97 0.95 1431
1 0.86 0.69 0.76 357
accuracy 0.91 1788
macro avg 0.89 0.83 0.85 1788
weighted avg 0.91 0.91 0.91 1788
rf_2_tuned_test=model_performance_classification(rf_2_tuned,x2_test,y2_test)
rf_2_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.891 | 0.829 | 0.914 |
dt_outlier=dt_drop.copy()
def outliers(col):
Q1 = dt_outlier[col].quantile(0.25)
Q3 = dt_outlier[col].quantile(0.75)
IQR = Q3 - Q1
min_value=Q1 - 1.5*IQR
max_value=Q3 + 1.5*IQR
dt_outlier[col]=dt_outlier[col].clip(lower=min_value, upper=max_value)
return dt_outlier
outliers("MORTDUE")
| BAD | LOAN | MORTDUE | VALUE | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.000 | 39025.000 | 0 | 10.500 | 0.000 | 0.000 | 94.367 | 1.000 | 9.000 | 34.818 |
| 1 | 1 | 1300 | 70053.000 | 68400.000 | 0 | 7.000 | 0.000 | 2.000 | 121.833 | 0.000 | 14.000 | 34.818 |
| 2 | 1 | 1500 | 13500.000 | 16700.000 | 0 | 4.000 | 0.000 | 0.000 | 149.467 | 1.000 | 10.000 | 34.818 |
| 3 | 1 | 1500 | 65019.000 | 89235.500 | 0 | 7.000 | 0.000 | 0.000 | 173.467 | 1.000 | 20.000 | 34.818 |
| 4 | 0 | 1700 | 97800.000 | 112000.000 | 1 | 3.000 | 0.000 | 0.000 | 93.333 | 0.000 | 14.000 | 34.818 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5955 | 0 | 88900 | 57264.000 | 90185.000 | 0 | 16.000 | 0.000 | 0.000 | 221.809 | 0.000 | 16.000 | 36.112 |
| 5956 | 0 | 89000 | 54576.000 | 92937.000 | 0 | 16.000 | 0.000 | 0.000 | 208.692 | 0.000 | 15.000 | 35.860 |
| 5957 | 0 | 89200 | 54045.000 | 92924.000 | 0 | 15.000 | 0.000 | 0.000 | 212.280 | 0.000 | 15.000 | 35.557 |
| 5958 | 0 | 89800 | 50370.000 | 91861.000 | 0 | 14.000 | 0.000 | 0.000 | 213.893 | 0.000 | 16.000 | 34.341 |
| 5959 | 0 | 89900 | 48811.000 | 88934.000 | 0 | 15.000 | 0.000 | 0.000 | 219.601 | 0.000 | 16.000 | 34.572 |
5960 rows × 12 columns
x3=dt_outlier.drop(["BAD"], axis=1)
y3=dt_outlier["BAD"]
x3_train, x3_test, y3_train, y3_test=train_test_split(x3,y3,test_size=0.3, random_state=1, stratify=y3)
print("x3 shape:", x3.shape)
print("y3 shape:", y3.shape)
x3 shape: (5960, 11) y3 shape: (5960,)
rf_3_tuned=RandomForestClassifier(class_weight={0:0.80, 1:0.2}, random_state=1)
params_rf={"n_estimators":[200,300], "max_depth":[None, 10, 20], "max_features":["auto", "sqrt"]}
scorer=metrics.make_scorer(recall_score, pos_label=1)
grid_obj=GridSearchCV(rf_3_tuned, params_rf, scoring=scorer, cv=5)
grid_obj=grid_obj.fit(x3_train, y3_train)
rf_3_tuned=grid_obj.best_estimator_
rf_3_tuned.fit(x3_train, y3_train)
RandomForestClassifier(class_weight={0: 0.8, 1: 0.2}, n_estimators=300,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.8, 1: 0.2}, n_estimators=300,
random_state=1)y_train_rf_3_tuned=rf_3_tuned.predict(x3_train)
metrics_score(y3_train, y_train_rf_3_tuned)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_rf_3_tuned=rf_3_tuned.predict(x3_test)
metrics_score(y3_test, y_test_rf_3_tuned)
precision recall f1-score support
0 0.93 0.97 0.95 1431
1 0.87 0.69 0.77 357
accuracy 0.92 1788
macro avg 0.90 0.83 0.86 1788
weighted avg 0.91 0.92 0.91 1788
rf_3_tuned_test=model_performance_classification(rf_3_tuned,x3_test,y3_test)
rf_3_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.896 | 0.831 | 0.917 |
Observations: several combination were attempted including dropping outliers of MORTDUE, VALUE and CLAGE. However the best results were achieved whilst conducting outlier treatment for MORTDUE only:
#Feature Importance Visualisation
importances_rf_3_tuned=rf_3_tuned.feature_importances_
columns=x3.columns
importances_rf_3_tuned=pd.DataFrame(importances_rf_3_tuned, index=columns, columns=["Importance"]).sort_values(by="Importance", ascending=False)
palette=sns.color_palette("viridis", len(importances_rf_3_tuned))
sns.barplot(x=importances_rf_3_tuned.Importance, y=importances_rf_3_tuned.index, palette=palette)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.title("Feature Importance - Random Forest Tuned #3")
Text(0.5, 1.0, 'Feature Importance - Random Forest Tuned #3')
Observations: DEBTINC remains the key feature in the mode, however we note some change in the following ones: DELINQ falls from 2nd to 5th place, VALUE comes 2nd (from 3rd), but remains very close to CLAGE. LOAN comes 4th, very close to DELINQ.
Least important features remain JOB and NINQ.
In the EDA, it appeared that MORTDUE and VALUE were strongly correlated (0.9 on the heatmap). In the latest feature importance mapping, it appeared that VALUE was more important than MORTDUE. We will try dropping MORTDUE to improve performance.
dt_drop2=dt_drop.copy()
dt_drop2=dt_drop2.drop("MORTDUE", axis=1)
dt_drop2.head()
| BAD | LOAN | VALUE | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 39025.000 | 0 | 10.500 | 0.000 | 0.000 | 94.367 | 1.000 | 9.000 | 34.818 |
| 1 | 1 | 1300 | 68400.000 | 0 | 7.000 | 0.000 | 2.000 | 121.833 | 0.000 | 14.000 | 34.818 |
| 2 | 1 | 1500 | 16700.000 | 0 | 4.000 | 0.000 | 0.000 | 149.467 | 1.000 | 10.000 | 34.818 |
| 3 | 1 | 1500 | 89235.500 | 0 | 7.000 | 0.000 | 0.000 | 173.467 | 1.000 | 20.000 | 34.818 |
| 4 | 0 | 1700 | 112000.000 | 1 | 3.000 | 0.000 | 0.000 | 93.333 | 0.000 | 14.000 | 34.818 |
x4=dt_drop2.drop(["BAD"], axis=1)
y4=dt_drop2["BAD"]
x4_train, x4_test, y4_train, y4_test=train_test_split(x4,y4,test_size=0.3, random_state=1, stratify=y3)
print("x4 shape:", x4.shape)
print("y4 shape:", y4.shape)
x4 shape: (5960, 10) y4 shape: (5960,)
rf_4_tuned=RandomForestClassifier(class_weight={0:0.80, 1:0.2}, random_state=1)
params_rf={"n_estimators":[200,300], "max_depth":[None, 10, 20], "max_features":["auto", "sqrt"]}
scorer=metrics.make_scorer(recall_score, pos_label=1)
grid_obj=GridSearchCV(rf_4_tuned, params_rf, scoring=scorer, cv=5)
grid_obj=grid_obj.fit(x4_train, y4_train)
rf_4_tuned=grid_obj.best_estimator_
rf_4_tuned.fit(x4_train, y4_train)
RandomForestClassifier(class_weight={0: 0.8, 1: 0.2}, n_estimators=300,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.8, 1: 0.2}, n_estimators=300,
random_state=1)y_train_rf_4_tuned=rf_4_tuned.predict(x4_train)
metrics_score(y4_train, y_train_rf_4_tuned)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_rf_4_tuned=rf_4_tuned.predict(x4_test)
metrics_score(y4_test, y_test_rf_4_tuned)
precision recall f1-score support
0 0.92 0.97 0.95 1431
1 0.85 0.68 0.75 357
accuracy 0.91 1788
macro avg 0.88 0.82 0.85 1788
weighted avg 0.91 0.91 0.91 1788
rf_4_tuned_test=model_performance_classification(rf_4_tuned,x4_test,y4_test)
rf_4_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.885 | 0.824 | 0.911 |
Observations: The performance is decreasing. We will revert to using our previous dataset (data_drop and associated indep. and dep. variables x3 and y3) to move forward.
gradboost_model=GradientBoostingClassifier(random_state=1)
gradboost_model.fit(x3_train, y3_train)
gradboost_model_perf_test=model_performance_classification(gradboost_model, x3_train, y3_train)
gradboost_model_perf_test
Observation: best performance to date for the Recall, as well as Precision and Accuracy.
XGBoost_model=XGBClassifier(random_state=1)
XGBoost_model.fit(x3_train, y3_train)
XGBoost_model_perf_test=model_performance_classification(XGBoost_model, x3_test, y3_test)
XGBoost_model_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.895 | 0.827 | 0.915 |
Observation: the performance decreases below the level reached with the Random Forest and Gradient Boosting. We will proceed with the models already developped.
lgbm=lgb.LGBMClassifier(class_weight={0: 0.80, 1: 0.2},random_state=1)
params_lgbm= {"n_estimators": [200, 300],
"max_depth": [10, 20, -1],
"learning_rate": [0.01, 0.1, 0.2],
"num_leaves": [31, 50, 100]
}
scorer=metrics.make_scorer(metrics.recall_score, pos_label=1)
grid_obj=GridSearchCV(lgbm, params_lgbm, scoring=scorer, cv=5)
grid_obj=grid_obj.fit(x3_train, y3_train)
lgbm_tuned=grid_obj.best_estimator_
y_train_lgbm_tuned = lgbm_tuned.predict(x3_train)
train_metrics = metrics.classification_report(y3_train, y_train_lgbm_tuned)
print("Training metrics:\n", train_metrics)
Training metrics:
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_lgbm_tuned = lgbm_tuned.predict(x3_test)
test_metrics = metrics.classification_report(y3_test, y_test_lgbm_tuned)
print("Test metrics:\n", test_metrics)
Test metrics:
precision recall f1-score support
0 0.93 0.98 0.95 1431
1 0.89 0.71 0.79 357
accuracy 0.92 1788
macro avg 0.91 0.84 0.87 1788
weighted avg 0.92 0.92 0.92 1788
lgbm_tuned_test=model_performance_classification(lgbm_tuned,x3_test,y3_test)
lgbm_tuned_test=pd.DataFrame(lgbm_tuned_test)
lgbm_tuned_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.912 | 0.842 | 0.924 |
Observation: this model seems to maximise the Accuracy (from .922 under the GradientBoost Classifier) to 0.924, however the Recall is plunging (from .840 under GradientBoost Classifier) to .705.
Note for the corrector: I ran the following model and found the results listed below in Observations, however, due to connection issues, I cannot get it to run again which is why I deleted the failed loading messaged and did not re-run de cells.
import catboost as cb
catboost_model=cb.CatBoostClassifier(random_state=1, silent=True)
params_catboost={"iterations": [200, 300, 500],
"depth": [4, 6, 10],
"learning_rate": [0.01, 0.1, 0.2],
"l2_leaf_reg": [1, 3, 5, 7],
"border_count": [32, 50, 100],
"bagging_temperature": [0.1, 0.5, 1.0],
"rsm": [0.6, 0.8, 1.0]}
grid_catboost=GridSearchCV(catboost_model, params_catboost, scoring=scorer, cv=5)
grid_catboost.fit(x3_train, y3_train)
best_catboost=grid_catboost.best_estimator_
y_test_pred_catboost=best_catboost.predict(x3_test, y3_test)
catboost_model_test=model_performance_classification(catboost_model,x4_test, y4_test)
catboost_model_test
Observations: the performance crashed down to 0.67 on the Recall metric, wihich, in isolation to all others, is enough of a fall not to further our exploration of such a model.
As the Gradient Booster Model performed best in tis initial stage, we will further work on the tuning.
#Reminder of GradienBoost Model Performance without Tuning
gradboost_model=GradientBoostingClassifier(random_state=1)
gradboost_model.fit(x3_train, y3_train)
gradboost_model_perf_test=model_performance_classification(gradboost_model, x3_train, y3_train)
gradboost_model_perf_test
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.905 | 0.840 | 0.922 |
gradboost_tuned1=GradientBoostingClassifier(random_state=1)
params_gb={"n_estimators":[150], "max_depth":[10], "max_features":["sqrt"], "learning_rate":[0.4], "min_samples_split": [10], "min_samples_leaf": [4]}
scorer=metrics.make_scorer(recall_score, pos_label=1)
grid_obj=GridSearchCV(gradboost_tuned1, params_gb, scoring=scorer, cv=3)
grid_obj=grid_obj.fit(x3_train, y3_train)
gradboost_tuned1=grid_obj.best_estimator_
gradboost_tuned1.fit(x3_train, y3_train)
GradientBoostingClassifier(learning_rate=0.4, max_depth=10, max_features='sqrt',
min_samples_leaf=4, min_samples_split=10,
n_estimators=150, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.4, max_depth=10, max_features='sqrt',
min_samples_leaf=4, min_samples_split=10,
n_estimators=150, random_state=1)y_train_gradboost_tuned1=gradboost_tuned1.predict(x3_train)
metrics_score(y3_train, y_train_gradboost_tuned1)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_gradboost_tuned1=gradboost_tuned1.predict(x3_test)
metrics_score(y3_test, y_test_gradboost_tuned1)
precision recall f1-score support
0 0.93 0.98 0.96 1431
1 0.92 0.70 0.80 357
accuracy 0.93 1788
macro avg 0.92 0.84 0.88 1788
weighted avg 0.93 0.93 0.92 1788
gradboost_tuned_test1=model_performance_classification(gradboost_tuned1,x3_test,y3_test)
gradboost_tuned_test1
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.923 | 0.844 | 0.928 |
Observation: higher performance to date with Recall at 0.844, and Precision and Accurcy above 0.92. We aim to improve the Recall score a bit more, however these reslts were reached after 8 iterations (not included in the code for clarity's sake), and each took at least 10 minutes to load. Thus to further tune the model, I will reduce the dataset size and cv score (cross validation) for the tuning.
#Reducing Dataset size #2
x3_train_sample = x3_train.sample(frac=0.5, random_state=1)
y3_train_sample = y3_train.sample(frac=0.5, random_state=1)
#Tuning #2
gradboost_tuned2=GradientBoostingClassifier(random_state=1)
params_gb={"n_estimators":[150], "max_depth":[10,11], "max_features":["sqrt"], "learning_rate":[0.38,0.39, 0.4], "min_samples_split": [10], "min_samples_leaf": [4], "subsample": [0.79, 0.8]}
scorer=metrics.make_scorer(recall_score, pos_label=1)
grid_obj=GridSearchCV(gradboost_tuned2, params_gb, scoring=scorer, cv=3)
grid_obj=grid_obj.fit(x3_train_sample, y3_train_sample)
gradboost_tuned2=grid_obj.best_estimator_
gradboost_tuned2.fit(x3_train, y3_train)
GradientBoostingClassifier(learning_rate=0.38, max_depth=11,
max_features='sqrt', min_samples_leaf=4,
min_samples_split=10, n_estimators=150,
random_state=1, subsample=0.8)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.38, max_depth=11,
max_features='sqrt', min_samples_leaf=4,
min_samples_split=10, n_estimators=150,
random_state=1, subsample=0.8)y_train_gradboost_tuned2=gradboost_tuned2.predict(x3_train)
metrics_score(y3_train, y_train_gradboost_tuned2)
precision recall f1-score support
0 1.00 1.00 1.00 3340
1 1.00 1.00 1.00 832
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
y_test_gradboost_tuned2=gradboost_tuned2.predict(x3_test)
metrics_score(y3_test, y_test_gradboost_tuned2)
precision recall f1-score support
0 0.93 0.98 0.96 1431
1 0.91 0.72 0.80 357
accuracy 0.93 1788
macro avg 0.92 0.85 0.88 1788
weighted avg 0.93 0.93 0.93 1788
gradboost_tuned_test2=model_performance_classification(gradboost_tuned2,x3_test,y3_test)
gradboost_tuned_test2
| Precision | Recall | Accuracy | |
|---|---|---|---|
| 0 | 0.924 | 0.850 | 0.930 |
gradboost_tuned_test2_df=pd.DataFrame(gradboost_tuned_test2, index=[0])
Observations: The hyperparameters reached the best performance achieved on the Recall score:
Another set of hyperparameters allowed to reach a better performance of Precision and Accuracy, but the focus being on Recall, I sided with the ones presented above.(max_depth 10 instead of 1, learning_rate 0.4 instead of 0.38, other hyperparameters unchanged):
Gradient Booster - Tuning #3 - Visualisation
#Feature Importance Visualisation
importances_gb_tuned=gradboost_tuned.feature_importances_
columns=x3.columns
importances_gb_tuned=pd.DataFrame(importances_gb_tuned, index=columns, columns=["Importance"]).sort_values(by="Importance", ascending=False)
palette=sns.color_palette("viridis", len(importances_gb_tuned))
sns.barplot(x=importances_gb_tuned.Importance, y=importances_gb_tuned.index, palette=palette)
plt.xlabel("Importance", fontsize=12)
plt.ylabel("Features", fontsize=12)
plt.title("Feature Importance - Gradient Booster Tuned #3")
Text(0.5, 1.0, 'Feature Importance - Gradient Booster Tuned #3')
Observations: the key features remain the same, whilst the least relevant ones, NINQ and JOBS have a close to equal importance scores.
We tried to drop the features but performance was decreasing (not included in the code for readibility purposes).
# We took out Logistic Regression from the table due to the different metrics being used
#We took out some of the models that we decided not to go with (ie: RF Tuning #4 which decreased performance,
#and led us to go back to RF Tuning #3)
models_test_comp_df = pd.concat(
[ dtree_test.T, dtree_tuned_test.T, dtree_tuned2_test.T, rf_1_test.T, rf_1_tuned_test.T,
rf_1_tuned_drop_test.T, rf_2_tuned_test.T, rf_3_tuned_test.T, gradboost_model_perf_test.T,
XGBoost_model_perf_test.T, lgbm_tuned_test.T, gradboost_tuned_test1.T, gradboost_tuned_test2.T,], axis = 1,)
models_test_comp_df.columns = [
"Decision Tree Classifier",
"Decision Tree Tuning #1",
"Decision Tree Tuning #2",
"Random Forest Classifier",
"Random Forest Tuning #1",
"Random Forest Tuning #1 Drop",
"Random Forest Tuning #2",
"Random Forest Tuning #3",
"Gradient Boost",
"XGBoost",
"Light GBM",
"Gradient Boost Tuning #1",
"Gradient Boost Tuning #2"]
models_test_comp_df
| Decision Tree Classifier | Decision Tree Tuning #1 | Decision Tree Tuning #2 | Random Forest Classifier | Random Forest Tuning #1 | Random Forest Tuning #1 Drop | Random Forest Tuning #2 | Random Forest Tuning #3 | Gradient Boost | XGBoost | Light GBM | Gradient Boost Tuning #1 | Gradient Boost Tuning #2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Precision | 0.806 | 0.805 | 0.832 | 0.895 | 0.904 | 0.891 | 0.891 | 0.896 | 0.905 | 0.895 | 0.912 | 0.923 | 0.924 |
| Recall | 0.788 | 0.783 | 0.768 | 0.817 | 0.824 | 0.829 | 0.829 | 0.831 | 0.840 | 0.827 | 0.842 | 0.844 | 0.850 |
| Accuracy | 0.874 | 0.872 | 0.880 | 0.912 | 0.917 | 0.914 | 0.914 | 0.917 | 0.922 | 0.915 | 0.924 | 0.928 | 0.930 |
The model exploration was primarily led by the importance of the recall metric, set as the measure of success on this exercise. Recall proved more difficult to improve that Precision and Accuracy, whic performed already quite well (about .8) in some of the untuned models, and passed the .9 performance grid in the early tuning stages. The final results are satisfactory for these two metrics, but there is still room for improvement on the Recall metric.
Best model: This exploration allowed us to settle for the Gradient Booster Model as the best performer (with tuning).
Improvements:
Key importance of the debt-to-income ratio (DEBTINC). This feature appeared consistently, across all models and tuning stages as the key feature. It appears that above a certain ratio, meaning when clients' loans represent a higher proportion of their income, the risk of defaulting increases abruptly. Before proceding further with the model development and roll-out, some initial buisness recommendations include:
Further risk mitigation methods for risky loans: the bank might want to consider, as part of its monitoring of higher-risk clients, to implement dedicated support activities, involving financial litteracy and education. Indeed it would appear that the higher risk profile is amongst the higher ratios, which would be in the highest needs of such support. Numerous studies have demonstrated that education is more effective than punishment to induce behaviour change, thus this could be an alley to explore for the bank to limit the risks involved with loans that were already provided. (not furture loans, which might be denied for this clientele)
What model do you propose to be adopted? Why is this the best solution to adopt?
We propose to adopt the Gradient Booster Model (post tuning) as such reached the best performance, and, with the use of a decreased dataset, was not computationally intensive in compares to highly tuned Random Forest models, or others such as CatBoost and LGM. The balance between performance and computational power drives us toward this method for the implementation, although improvement steps could be implemented to improve performance further.
Data collection and monitoring: as monitoring develops, it would be handy to keep tuning the model, as well as 2-3 alternative models in parallel, as, as the data evolves, other models might start to perform better.
dt_default=dt_outlier.copy()
dt_default=dt_default[dt_default["BAD"]==1]
print( "Total default :", dt_default["LOAN"].sum())
print( "Total loans :", dt["LOAN"].sum())
Total default : 20120400 Total loans : 110903500
#Copy the dataset to create bins and facilitate the visualisation
dt_bins2=dt_default.copy()
def loan_sums_debtinc(dt_bins2):
#Bining DEBTINC
bins = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, np.inf]
names = ['<1', '1-2', '2-3', '3-4', '4-5', '5-6', '6-7', '7-8', '8-9', '9-10', '10-11', '11-12', '12-13', '13-14', '14-15', '15-16', '16-17', '17-18', '18-19', '19-20', '20-21', '21-22', '22-23', '23-24', '24-25', '25-26', '26-27', '27-28', '28-29', '29-30', '30-31', '31-32', '32-33', '33-34', '34-35', '35-36', '36-37', '37-38', '38-39', '39-40', '40+']
dt_bins2['DEBTINC'] = pd.cut(dt_bins2['DEBTINC'], bins, labels=names, right=False)
dt_bins2['DEBTINC'] = dt_bins2['DEBTINC'].cat.reorder_categories(names, ordered=True)
loan_sums = dt_bins2.groupby("DEBTINC")['LOAN'].sum().reset_index()
loan_sums.columns = ["DEBTINC", 'Loan_Sum']
total_loan_sum = loan_sums['Loan_Sum'].sum()
loan_sums['Percentage_of_Total'] = (loan_sums['Loan_Sum'] / total_loan_sum) * 100
return loan_sums
loan_sums_table = loan_sums_debtinc(dt_bins2)
print(loan_sums_table)
DEBTINC Loan_Sum Percentage_of_Total 0 <1 31100 0.155 1 1-2 98700 0.491 2 2-3 57100 0.284 3 3-4 67200 0.334 4 4-5 0 0.000 5 5-6 28300 0.141 6 6-7 27700 0.138 7 7-8 0 0.000 8 8-9 0 0.000 9 9-10 0 0.000 10 10-11 0 0.000 11 11-12 0 0.000 12 12-13 42000 0.209 13 13-14 42800 0.213 14 14-15 0 0.000 15 15-16 0 0.000 16 16-17 55900 0.278 17 17-18 20500 0.102 18 18-19 35200 0.175 19 19-20 13400 0.067 20 20-21 24000 0.119 21 21-22 61500 0.306 22 22-23 80700 0.401 23 23-24 58200 0.289 24 24-25 60300 0.300 25 25-26 107000 0.532 26 26-27 33100 0.165 27 27-28 117400 0.583 28 28-29 72900 0.362 29 29-30 104800 0.521 30 30-31 104700 0.520 31 31-32 307200 1.527 32 32-33 186300 0.926 33 33-34 209500 1.041 34 34-35 12918000 64.203 35 35-36 477700 2.374 36 36-37 388900 1.933 37 37-38 197200 0.980 38 38-39 304100 1.511 39 39-40 166900 0.830 40 40+ 3620100 17.992
#Bining DEBTINC 1
bins = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, np.inf]
names = ['<1', '1-2', '2-3', '3-4', '4-5', '5-6', '6-7', '7-8', '8-9', '9-10', '10-11', '11-12', '12-13', '13-14', '14-15', '15-16', '16-17', '17-18', '18-19', '19-20', '20-21', '21-22', '22-23', '23-24', '24-25', '25-26', '26-27', '27-28', '28-29', '29-30', '30-31', '31-32', '32-33', '33-34', '34-35', '35-36', '36-37', '37-38', '38-39', '39-40', '40+']
dt_bins['DEBTINC'] = pd.cut(dt_bins['DEBTINC'], bins, labels=names, right=False)
dt_bins['DEBTINC'] = dt_bins['DEBTINC'].cat.reorder_categories(names, ordered=True)
#Table function
from tabulate import tabulate
def key_feat(col,head):
breakdown=(dt_bins.groupby(col)["BAD"].value_counts(normalize=True)*100).unstack().fillna(0)
breakdown=breakdown.sort_index()
key_feat_subs= breakdown[1].sort_values(ascending=False).head(head)
key_feat_df = key_feat_subs.reset_index()
key_feat_df.columns = [col, '% of defaulters (BAD=1)']
table = tabulate(key_feat_df, headers='keys', tablefmt='fancy_grid', showindex=False)
print(table)
#Granular breakdown of the relative risk (1 by 1)
key_feat('DEBTINC',40)
╒═══════════╤═══════════════════════════╕ │ DEBTINC │ % of defaulters (BAD=1) │ ╞═══════════╪═══════════════════════════╡ │ 6-7 │ 100 │ ├───────────┼───────────────────────────┤ │ <1 │ 66.6667 │ ├───────────┼───────────────────────────┤ │ 2-3 │ 50 │ ├───────────┼───────────────────────────┤ │ 1-2 │ 50 │ ├───────────┼───────────────────────────┤ │ 5-6 │ 50 │ ├───────────┼───────────────────────────┤ │ 3-4 │ 37.5 │ ├───────────┼───────────────────────────┤ │ 40+ │ 19.0157 │ ├───────────┼───────────────────────────┤ │ 13-14 │ 12.5 │ ├───────────┼───────────────────────────┤ │ 16-17 │ 12.5 │ ├───────────┼───────────────────────────┤ │ 35-36 │ 9.83607 │ ├───────────┼───────────────────────────┤ │ 31-32 │ 9.04255 │ ├───────────┼───────────────────────────┤ │ 36-37 │ 8.9701 │ ├───────────┼───────────────────────────┤ │ 12-13 │ 7.69231 │ ├───────────┼───────────────────────────┤ │ 18-19 │ 7.69231 │ ├───────────┼───────────────────────────┤ │ 38-39 │ 7.21649 │ ├───────────┼───────────────────────────┤ │ 21-22 │ 6.25 │ ├───────────┼───────────────────────────┤ │ 32-33 │ 6.25 │ ├───────────┼───────────────────────────┤ │ 34-35 │ 5.66038 │ ├───────────┼───────────────────────────┤ │ 23-24 │ 5.61798 │ ├───────────┼───────────────────────────┤ │ 17-18 │ 5.55556 │ ├───────────┼───────────────────────────┤ │ 22-23 │ 5.55556 │ ├───────────┼───────────────────────────┤ │ 24-25 │ 5.26316 │ ├───────────┼───────────────────────────┤ │ 33-34 │ 5.19481 │ ├───────────┼───────────────────────────┤ │ 25-26 │ 5.17241 │ ├───────────┼───────────────────────────┤ │ 37-38 │ 4.87805 │ ├───────────┼───────────────────────────┤ │ 27-28 │ 4.43038 │ ├───────────┼───────────────────────────┤ │ 30-31 │ 4.30108 │ ├───────────┼───────────────────────────┤ │ 39-40 │ 4.25532 │ ├───────────┼───────────────────────────┤ │ 28-29 │ 4.16667 │ ├───────────┼───────────────────────────┤ │ 29-30 │ 3.51759 │ ├───────────┼───────────────────────────┤ │ 26-27 │ 3.2 │ ├───────────┼───────────────────────────┤ │ 20-21 │ 2.89855 │ ├───────────┼───────────────────────────┤ │ 19-20 │ 2.5 │ ├───────────┼───────────────────────────┤ │ 8-9 │ 0 │ ├───────────┼───────────────────────────┤ │ 4-5 │ 0 │ ├───────────┼───────────────────────────┤ │ 15-16 │ 0 │ ├───────────┼───────────────────────────┤ │ 14-15 │ 0 │ ├───────────┼───────────────────────────┤ │ 11-12 │ 0 │ ├───────────┼───────────────────────────┤ │ 10-11 │ 0 │ ├───────────┼───────────────────────────┤ │ 9-10 │ 0 │ ╘═══════════╧═══════════════════════════╛
#Second breakdown of the relative risk (6 by 6 to calculate
#overall risk of 0-6 sub-group identified)
#Bining DEBTINC 2
bins = [0, 6, 12, 18, 24, 30, 36, 42, np.inf]
names = ['<6', '6-12', '12-18', '18-24', '24-30', '30-36', '36-42', '42+']
dt_bins['DEBTINC'] = pd.cut(dt_bins['DEBTINC'], bins, labels=names, right=False)
dt_bins['DEBTINC'] = dt_bins['DEBTINC'].cat.reorder_categories(names, ordered=True)
#Table function
key_feat('DEBTINC',40)
╒═══════════╤═══════════════════════════╕ │ DEBTINC │ % of defaulters (BAD=1) │ ╞═══════════╪═══════════════════════════╡ │ <6 │ 40.7407 │ ├───────────┼───────────────────────────┤ │ 30-36 │ 34.1416 │ ├───────────┼───────────────────────────┤ │ 42+ │ 34.0845 │ ├───────────┼───────────────────────────┤ │ 6-12 │ 14.2857 │ ├───────────┼───────────────────────────┤ │ 36-42 │ 7.23529 │ ├───────────┼───────────────────────────┤ │ 12-18 │ 6.12245 │ ├───────────┼───────────────────────────┤ │ 18-24 │ 5 │ ├───────────┼───────────────────────────┤ │ 24-30 │ 4.20561 │ ╘═══════════╧═══════════════════════════╛
Observation - ABSOLUTE RISK Addressing the defaults for DEBTINC = 34-35 would allow the bank to avoid 64,204 % of its defaults, thereby saving 12,918,000 USD/EUR (currency unknown). Keeping a keen eye on the 40+ DEBTINC ratios would also allow to save 17.992% of defaults, amounting to 3,620,100 USD/EUR. All in all, closely monitoring, and reducing exposure to risk in these two ratio groups could allow to reduce defaults by 82.196%.
Observation - RELATIVE RISK : relative and absolute risks identify different categories that should be addressed by the bank.
- 0-6: risk/ default rate reaches 40.74% default rates - representing only 1.54% of all loans but amounting to 3.1 million USD/EUR
- 34-35: risk ratio only at 5.66%
- 40+: risk ratio at 19.02% -- more detailed analysis could be interesting at the max value of the debt is 203.31, thus a number of sub-groups might exist within the outliers, each with varying risk levels.
dt["DEBTINC"].max()
203.31214869
def loan_sums_debtinc(dt_bins2):
#Bining CLAGE
bins = [0, 50, 100, 150, 200, 250, np.inf]
names = ['<50', '50-10', '100-150', '150-200','200-250', '250+']
dt_bins2['CLAGE'] = pd.cut(dt_bins2['CLAGE'], bins, labels=names, right=False)
dt_bins2['CLAGE'] = dt_bins2['CLAGE'].cat.reorder_categories(names, ordered=True)
loan_sums = dt_bins2.groupby("CLAGE")['LOAN'].sum().reset_index()
loan_sums.columns = ["CLAGE", 'Loan_Sum']
total_loan_sum = loan_sums['Loan_Sum'].sum()
loan_sums['Percentage_of_Total'] = (loan_sums['Loan_Sum'] / total_loan_sum) * 100
return loan_sums
loan_sums_table = loan_sums_debtinc(dt_default)
print(loan_sums_table)
CLAGE Loan_Sum Percentage_of_Total 0 <50 845300 4.453 1 50-10 3505900 18.468 2 100-150 5909100 31.127 3 150-200 3738900 19.695 4 200-250 2861000 15.071 5 250+ 2123600 11.186
dt_default=dt_outlier.copy()
dt_default=dt_default[dt_default["BAD"]==1]
print( "Total default :", dt_default["LOAN"].sum())
print( "Total loans :", dt["LOAN"].sum())
Total default : 20120400 Total loans : 110903500
#Copy the dataset to create bins and facilitate the visualisation
dt_bins2=dt_default.copy()
def loan_sums_debtinc(dt_bins2):
#Bining CLAGE
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, np.inf]
names = ['<10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100', '100-110', '110-120', '120-130', '130-140', '140-150', '150-160', '160-170', '170-180', '180-190', '190-200', '200-210', '210-220', '220-230', '230-240', '240-250', '250+']
dt_bins2['CLAGE'] = pd.cut(dt_bins2['CLAGE'], bins, labels=names, right=False)
dt_bins2['CLAGE'] = dt_bins2['CLAGE'].cat.reorder_categories(names, ordered=True)
loan_sums = dt_bins2.groupby("CLAGE")['LOAN'].sum().reset_index()
loan_sums.columns = ["CLAGE", 'Loan_Sum']
total_loan_sum = loan_sums['Loan_Sum'].sum()
loan_sums['Percentage_of_Total'] = (loan_sums['Loan_Sum'] / total_loan_sum) * 100
return loan_sums
loan_sums_table = loan_sums_debtinc(dt_default)
print(loan_sums_table)
CLAGE Loan_Sum Percentage_of_Total 0 <10 192500 0.957 1 10-20 149000 0.741 2 20-30 114100 0.567 3 30-40 165800 0.824 4 40-50 223900 1.113 5 50-60 370500 1.841 6 60-70 500900 2.490 7 70-80 675100 3.355 8 80-90 853100 4.240 9 90-100 1106300 5.498 10 100-110 1088200 5.408 11 110-120 1556100 7.734 12 120-130 1408400 7.000 13 130-140 1106900 5.501 14 140-150 749500 3.725 15 150-160 584700 2.906 16 160-170 743500 3.695 17 170-180 1925400 9.569 18 180-190 956100 4.752 19 190-200 665800 3.309 20 200-210 708400 3.521 21 210-220 713400 3.546 22 220-230 428700 2.131 23 230-240 666300 3.312 24 240-250 344200 1.711 25 250+ 2123600 10.554
#Bining CLAGE
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, np.inf]
names = ['<10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100', '100-110', '110-120', '120-130', '130-140', '140-150', '150-160', '160-170', '170-180', '180-190', '190-200', '200-210', '210-220', '220-230', '230-240', '240-250', '250+']
dt_bins['CLAGE'] = pd.cut(dt_bins['CLAGE'], bins, labels=names, right=False)
dt_bins['CLAGE'] = dt_bins['CLAGE'].cat.reorder_categories(names, ordered=True)
key_feat('CLAGE',25)
╒═════════╤═══════════════════════════╕ │ CLAGE │ % of defaulters (BAD=1) │ ╞═════════╪═══════════════════════════╡ │ <10 │ 58.8235 │ ├─────────┼───────────────────────────┤ │ 20-30 │ 52.6316 │ ├─────────┼───────────────────────────┤ │ 30-40 │ 45.8333 │ ├─────────┼───────────────────────────┤ │ 10-20 │ 45 │ ├─────────┼───────────────────────────┤ │ 40-50 │ 42.8571 │ ├─────────┼───────────────────────────┤ │ 50-60 │ 37.6812 │ ├─────────┼───────────────────────────┤ │ 60-70 │ 31.3559 │ ├─────────┼───────────────────────────┤ │ 90-100 │ 27.6119 │ ├─────────┼───────────────────────────┤ │ 70-80 │ 27.5 │ ├─────────┼───────────────────────────┤ │ 80-90 │ 27.1605 │ ├─────────┼───────────────────────────┤ │ 130-140 │ 26.8595 │ ├─────────┼───────────────────────────┤ │ 120-130 │ 26.5363 │ ├─────────┼───────────────────────────┤ │ 140-150 │ 25.7895 │ ├─────────┼───────────────────────────┤ │ 100-110 │ 24.9097 │ ├─────────┼───────────────────────────┤ │ 110-120 │ 22.9102 │ ├─────────┼───────────────────────────┤ │ 160-170 │ 20.398 │ ├─────────┼───────────────────────────┤ │ 150-160 │ 19.7674 │ ├─────────┼───────────────────────────┤ │ 210-220 │ 16.8269 │ ├─────────┼───────────────────────────┤ │ 170-180 │ 16.6124 │ ├─────────┼───────────────────────────┤ │ 180-190 │ 16.4234 │ ├─────────┼───────────────────────────┤ │ 230-240 │ 15.6425 │ ├─────────┼───────────────────────────┤ │ 200-210 │ 15.1751 │ ├─────────┼───────────────────────────┤ │ 220-230 │ 14.8571 │ ├─────────┼───────────────────────────┤ │ 190-200 │ 12.9921 │ ├─────────┼───────────────────────────┤ │ 250+ │ 10.1382 │ ╘═════════╧═══════════════════════════╛
Observation ABSOLUTE: there is a clear increase of the risk from 50 until 200, with a peak on the 100-150 range with 31.127% (versus 18.468% for 50-100, and 19.695% on 150-200 range). Increasing the level of details, and reaching a more granular approach:
In comparisonm the highest risk identified in absolute terms, represent lower relative risk levels, as the hgihger absolute risk is driven by higher total amounts borrowed: